NFS and kernel 2.6.x

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* NFS and kernel 2.6.x
@ 2004-04-16  1:14 Charles Shannon Hendrix
  2004-04-16  1:31 ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Charles Shannon Hendrix @ 2004-04-16  1:14 UTC (permalink / raw)
  To: Linux Kernel

I'm having a hard time right now with NFS on kernel 2.6.

I tried to search archives but can't find much on my exact problem.  If
I missed something good, a pointer would be great.

Anyway, the problem: NFS writes are broken in 2.6 on my machine.

I normally mount several volumes from a Sun SS5 running NetBSD.

It's worked great for years, and usually is not too bad on speed.

When I moved to Linux kernel 2.6.1, writes to the NetBSD server got
incredibly slow.  Like it went from around 600K/sec to just a few K/sec
to maybe 25K/sec.

By contrast, rsync runs at around 900K/sec or faster, close to wire
speed (yes, raw speed, not compressed speed).

With kernels 2.6.3 and 2.6.5, it doesn't work at all.  If I do something
like this:

% cp bigfile /public

It just hangs.  After that umounts or even reads of that volume hang.
They can be killed, but not always.  Gnome's Nautilus for example gets
permanently hung, though that might be its own issue.

Offhand, I cannot remember what NFS write performance was with Linux
kernel 2.4, but it was several hundred K/sec unless the server was
loaded.

Reading from the NFS server seems to still be fine.  For example, just
now I copied a file from there at around 660K/sec using kernel 2.6.5
on the client.

Anyway, I would like to explore this further and solve the problem.

Details on my setup:

NFS server:

    Sun SS5
    10baseT ethernet (100baseT card available, not used)
    NetBSD 1.6.1
    pretty much a plain vanilla server setup

Network:

    simple LAN with three machines, connected via a full duplex
    multi-speed switch

NFS client:

    vanilla PC
    Intel Pro/100 ethernet
    Slackware 9.1
    Linux kernel 2.6.5, plain with no mods or patches, only enough
	drivers and features enabled to run my workstation
	configuration as close as I could get to my Linux 2.4
	kernel

-- 
shannon "AT" widomaker.com -- ["All of us get lost in the darkness,
dreamers turn to look at the stars" -- Rush ]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  1:14 NFS and kernel 2.6.x Charles Shannon Hendrix
@ 2004-04-16  1:31 ` Trond Myklebust
  2004-04-16  1:53   ` Andrew Morton
       [not found]   ` <20040416190126.GB408@widomaker.com>
  0 siblings, 2 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-16  1:31 UTC (permalink / raw)
  To: Charles Shannon Hendrix; +Cc: Linux Kernel

På to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix:
> 

> NFS server:
> 
>     Sun SS5
>     10baseT ethernet (100baseT card available, not used)
>     NetBSD 1.6.1
>     pretty much a plain vanilla server setup
> 
> Network:
> 
>     simple LAN with three machines, connected via a full duplex
>     multi-speed switch
> 
> NFS client:
> 
>     vanilla PC
>     Intel Pro/100 ethernet
>     Slackware 9.1
>     Linux kernel 2.6.5, plain with no mods or patches, only enough
> 	drivers and features enabled to run my workstation
> 	configuration as close as I could get to my Linux 2.4
> 	kernel

This is pretty much covered in the NFS FAQ entry B10.

You are experiencing the classical effects of using unreliable transport
(i.e. UDP) on a mixed speed network. Writes to the server are getting
lost, because it is on a slow segment that cannot keep up with the
faster 100Mbit clients.

Use the 'proto=tcp' mount option, and all will be well again.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  1:31 ` Trond Myklebust
@ 2004-04-16  1:53   ` Andrew Morton
  2004-04-16  2:54     ` Trond Myklebust
  2004-04-16  9:03     ` Jamie Lokier
       [not found]   ` <20040416190126.GB408@widomaker.com>
  1 sibling, 2 replies; 54+ messages in thread
From: Andrew Morton @ 2004-04-16  1:53 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: shannon, linux-kernel

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>
> På to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix:
> > 
> 
> > NFS server:
> > 
> >     Sun SS5
> >     10baseT ethernet (100baseT card available, not used)
> >     NetBSD 1.6.1
> >     pretty much a plain vanilla server setup
> > 
> > Network:
> > 
> >     simple LAN with three machines, connected via a full duplex
> >     multi-speed switch
> > 
> > NFS client:
> > 
> >     vanilla PC
> >     Intel Pro/100 ethernet
> >     Slackware 9.1
> >     Linux kernel 2.6.5, plain with no mods or patches, only enough
> > 	drivers and features enabled to run my workstation
> > 	configuration as close as I could get to my Linux 2.4
> > 	kernel
> 
> This is pretty much covered in the NFS FAQ entry B10.
> 
> You are experiencing the classical effects of using unreliable transport
> (i.e. UDP) on a mixed speed network. Writes to the server are getting
> lost, because it is on a slow segment that cannot keep up with the
> faster 100Mbit clients.

But Charles was seeing good performance with 2.4-based clients.  When he
went to 2.6 everything fell apart.

Do we know why this regression occurred?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  1:53   ` Andrew Morton
@ 2004-04-16  2:54     ` Trond Myklebust
  2004-04-16  4:59       ` Phil Oester
  2004-04-16  9:03     ` Jamie Lokier
  1 sibling, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-16  2:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shannon, linux-kernel

På to , 15/04/2004 klokka 18:53, skreiv Andrew Morton:
> But Charles was seeing good performance with 2.4-based clients.  When he
> went to 2.6 everything fell apart.
> 
> Do we know why this regression occurred?

What regression??? You have a statistic of 1 person whose 3 clients
changed from what was an apparently working setup to what has *always*
been the usual scenario for most people that tried to use the same
broken hardware/software combination whether it be in 2.2.x, 2.4.x or
2.6.x. 

The whole problem is that UDP provides unreliable transport... It offers
NO guarantees that the packet will arrive at the destination.
If only 1 fragment out of the 22 that it takes to send a single
wsize=32k write request to the Sun server gets lost on the way, the
Sun's networking layer will ignore that entire packet, and so the whole
write has to time out and get resent.
Switches can usually cache a few fragments if the clients on the 100Mbit
network are sending requests at a rate that almost matches the 10Mbit
bandwidth that the Sun server supports, but if the network is swamped so
that the switch runs out of cache, then it will start to drop packets.

This is the whole reason why Sun set TCP to be their default mount
option when the changed their servers to use 32k read/write.

My biggest suspect for why this particular setup changed in 2.6.x would
therefore be the changes to the way in which writes are scheduled on the
wire. We cache them for longer, and so overall the bandwidth usage goes
down, but at the expense of more "burstiness" when the user closes the
file or does some other fsync()-like operation.

So in fact you have 2 possible workarounds:

  - Use the TCP mount option (by far the better option, since TCP *does*
provide reliable transport).
  - Keep UDP, but use the wsize mount option to explicitly override the
server's choice of write sizes. That works by reducing the number of
fragments per write, and so improving performance by reducing the amount
of data that need to be resent per fragment lost.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  2:54     ` Trond Myklebust
@ 2004-04-16  4:59       ` Phil Oester
  2004-04-16  5:29         ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Phil Oester @ 2004-04-16  4:59 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, shannon, linux-kernel

Actually I can concur -- I recently migrated 100+ servers from 2.4.x
to 2.6.3, and simply could not use UDP mounts and achieve acceptable
performance. Further, I wasn't using 32K r/w as you posit, but was
using 8K (against a NetApp FWIW).

If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
perhaps this should be documented -- or the option should be deprecated.

Phil Oester

On Thu, Apr 15, 2004 at 07:54:08PM -0700, Trond Myklebust wrote:
> På to , 15/04/2004 klokka 18:53, skreiv Andrew Morton:
> > But Charles was seeing good performance with 2.4-based clients.  When he
> > went to 2.6 everything fell apart.
> > 
> > Do we know why this regression occurred?
> 
> What regression??? You have a statistic of 1 person whose 3 clients

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  4:59       ` Phil Oester
@ 2004-04-16  5:29         ` Trond Myklebust
  2004-04-16  7:13           ` Paul Wagland
                             ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-16  5:29 UTC (permalink / raw)
  To: Phil Oester; +Cc: Andrew Morton, shannon, linux-kernel

På to , 15/04/2004 klokka 21:59, skreiv Phil Oester:

> If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
> perhaps this should be documented -- or the option should be deprecated.

Put simply: I am not interested in wasting _my_ time investigating cases
where UDP is performing badly if TCP is working fine. The variable
reliability issues with UDP are precisely why we worked to get the TCP
stuff working efficiently.

As for blanket statements like the above: I have seen no evidence yet
that they are any more warranted in 2.6.x than they were in 2.4.x. At
least not as long as I continue to see wire speed performance on reads
and writes on UDP on all my own test setups.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  5:29         ` Trond Myklebust
@ 2004-04-16  7:13           ` Paul Wagland
  2004-04-16 14:44           ` Marcelo Tosatti
  2004-04-17 16:44           ` Matthias Urlichs
  2 siblings, 0 replies; 54+ messages in thread
From: Paul Wagland @ 2004-04-16  7:13 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: shannon, Phil Oester, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]


On Apr 16, 2004, at 7:29, Trond Myklebust wrote:

> På to , 15/04/2004 klokka 21:59, skreiv Phil Oester:
>
>> If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts 
>> unusable,
>> perhaps this should be documented -- or the option should be 
>> deprecated.
>
> As for blanket statements like the above: I have seen no evidence yet
> that they are any more warranted in 2.6.x than they were in 2.4.x. At
> least not as long as I continue to see wire speed performance on reads
> and writes on UDP on all my own test setups.

Just as an aside, I can confirm this as well... we use UDP mounts, and 
get a pretty constant 10MB/s (assuming people aren't running bloody 
xscreensavers!*!)

Cheers,
Paul

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 186 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  5:29         ` Trond Myklebust
  2004-04-16  7:13           ` Paul Wagland
@ 2004-04-16 14:44           ` Marcelo Tosatti
  2004-04-16 14:46             ` Marcelo Tosatti
                               ` (2 more replies)
  2004-04-17 16:44           ` Matthias Urlichs
  2 siblings, 3 replies; 54+ messages in thread
From: Marcelo Tosatti @ 2004-04-16 14:44 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel, Andrew Morton, shannon, Phil Oester

On Thu, Apr 15, 2004 at 10:29:06PM -0700, Trond Myklebust wrote:
> På to , 15/04/2004 klokka 21:59, skreiv Phil Oester:
> 
> > If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
> > perhaps this should be documented -- or the option should be deprecated.
> 
> Put simply: I am not interested in wasting _my_ time investigating cases
> where UDP is performing badly if TCP is working fine. The variable
> reliability issues with UDP are precisely why we worked to get the TCP
> stuff working efficiently.
> 
> As for blanket statements like the above: I have seen no evidence yet
> that they are any more warranted in 2.6.x than they were in 2.4.x. At
> least not as long as I continue to see wire speed performance on reads
> and writes on UDP on all my own test setups.

Maaybe TCP should be the default then ? In case no one finds the reason 
why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in 
theory?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 14:44           ` Marcelo Tosatti
@ 2004-04-16 14:46             ` Marcelo Tosatti
  2004-04-16 15:50             ` Trond Myklebust
  2004-04-16 15:55             ` Dave Gilbert (Home)
  2 siblings, 0 replies; 54+ messages in thread
From: Marcelo Tosatti @ 2004-04-16 14:46 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel, Andrew Morton, shannon, Phil Oester

On Fri, Apr 16, 2004 at 11:44:33AM -0300, Marcelo Tosatti wrote:
> On Thu, Apr 15, 2004 at 10:29:06PM -0700, Trond Myklebust wrote:
> > På to , 15/04/2004 klokka 21:59, skreiv Phil Oester:
> > 
> > > If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
> > > perhaps this should be documented -- or the option should be deprecated.
> > 
> > Put simply: I am not interested in wasting _my_ time investigating cases
> > where UDP is performing badly if TCP is working fine. The variable
> > reliability issues with UDP are precisely why we worked to get the TCP
> > stuff working efficiently.
> > 
> > As for blanket statements like the above: I have seen no evidence yet
> > that they are any more warranted in 2.6.x than they were in 2.4.x. At
> > least not as long as I continue to see wire speed performance on reads
> > and writes on UDP on all my own test setups.
> 
> Maaybe TCP should be the default then ? 

Or just make a big warning in the Kconfig. Distros will
set it to the default...

> In case no one finds the reason 
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in 
> theory?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 14:44           ` Marcelo Tosatti
  2004-04-16 14:46             ` Marcelo Tosatti
@ 2004-04-16 15:50             ` Trond Myklebust
  2004-04-16 15:55             ` Dave Gilbert (Home)
  2 siblings, 0 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-16 15:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, Andrew Morton, shannon, Phil Oester

On Fri, 2004-04-16 at 07:44, Marcelo Tosatti wrote:
> Maaybe TCP should be the default then ? In case no one finds the reason 
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in 
> theory?

Are you talking about the TCP server configuration option here, or the
TCP mount option? IMO both should be default.

I've got a patch for the "mount" program, which I've been intending to
send on to Andries (I've just been too busy for the past few weeks to
give it a last review).

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 14:44           ` Marcelo Tosatti
  2004-04-16 14:46             ` Marcelo Tosatti
  2004-04-16 15:50             ` Trond Myklebust
@ 2004-04-16 15:55             ` Dave Gilbert (Home)
  2004-04-16 16:13               ` Trond Myklebust
  2 siblings, 1 reply; 54+ messages in thread
From: Dave Gilbert (Home) @ 2004-04-16 15:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Trond Myklebust, linux-kernel, Andrew Morton, shannon,
	Phil Oester

Marcelo Tosatti wrote:

> Maaybe TCP should be the default then ? In case no one finds the reason 
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in 
> theory?

While it is reasonable to make TCP default it is important that if there
is a real problem with UDP NFS that it is sorted.  Some of us have to
work with older machines and kernels on clients that don't support TCP NFS.

Dave


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 15:55             ` Dave Gilbert (Home)
@ 2004-04-16 16:13               ` Trond Myklebust
  2004-04-16 19:07                 ` Daniel Egger
  2004-04-16 19:11                 ` Charles Shannon Hendrix
  0 siblings, 2 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-16 16:13 UTC (permalink / raw)
  To: Dave Gilbert (Home)
  Cc: Marcelo Tosatti, linux-kernel, Andrew Morton, shannon,
	Phil Oester

On Fri, 2004-04-16 at 08:55, Dave Gilbert (Home) wrote:
> While it is reasonable to make TCP default it is important that if there
> is a real problem with UDP NFS that it is sorted.  Some of us have to
> work with older machines and kernels on clients that don't support TCP NFS.

Then "some of you" can send in a proper bugreport in the usual format if
and when that problem actually occurs.

So far I have NOTHING to tell me there is a problem here. Just a load of
people going ballistic over hot air....



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 16:13               ` Trond Myklebust
@ 2004-04-16 19:07                 ` Daniel Egger
  2004-04-17  4:56                   ` Chris Friesen
  2004-04-17  5:24                   ` Trond Myklebust
  2004-04-16 19:11                 ` Charles Shannon Hendrix
  1 sibling, 2 replies; 54+ messages in thread
From: Daniel Egger @ 2004-04-16 19:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel

[-- Attachment #1.1: Type: text/plain, Size: 998 bytes --]

On 16.04.2004, at 18:13, Trond Myklebust wrote:

> Then "some of you" can send in a proper bugreport in the usual format 
> if
> and when that problem actually occurs.

> So far I have NOTHING to tell me there is a problem here. Just a load 
> of
> people going ballistic over hot air....

Great you want to help here. So I've a system which is NFS root using a
3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
water somewhere in between 10 seconds and 5 minutes after boot using
NFS over UDP. The last thing I see are 3 or 4 messages of the type:

server 192.168.11.2 not responding, still trying

NFS seems to work better with 2.6.4 which unfortuntely has other nasty
bugs for me; currently I'm running 2.4.26 which works fine, over both
UDP  and TCP.

Preempt is off as are the NFS features which I do not trust yet (v4 and
direct IO). Attached is the config for your viewing pleasure.

Please tell me how I can help here and I'll certainly do it.

Servus,
       Daniel

[-- Attachment #1.2: config.gz --]
[-- Type: application/x-gzip, Size: 8138 bytes --]

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 478 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 19:07                 ` Daniel Egger
@ 2004-04-17  4:56                   ` Chris Friesen
  2004-04-17  9:56                     ` Daniel Egger
  2004-04-17  5:24                   ` Trond Myklebust
  1 sibling, 1 reply; 54+ messages in thread
From: Chris Friesen @ 2004-04-17  4:56 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Trond Myklebust, Linux Kernel

Daniel Egger wrote:

> Great you want to help here. So I've a system which is NFS root using a
> 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
> water somewhere in between 10 seconds and 5 minutes after boot using
> NFS over UDP. The last thing I see are 3 or 4 messages of the type:

If this is an issue, it might make sense to have root be a tmpfs 
filesystem, and then have specific network mounts.  Note--don't make 
"/var/log" network mounted, various apps default to trying to check for 
files there--if the server goes away, you can't log in/out.

Chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17  4:56                   ` Chris Friesen
@ 2004-04-17  9:56                     ` Daniel Egger
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Egger @ 2004-04-17  9:56 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Linux Kernel, Trond Myklebust

[-- Attachment #1: Type: text/plain, Size: 1853 bytes --]

On 17.04.2004, at 06:56, Chris Friesen wrote:

>> Great you want to help here. So I've a system which is NFS root using 
>> a
>> 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in 
>> the
>> water somewhere in between 10 seconds and 5 minutes after boot using
>> NFS over UDP. The last thing I see are 3 or 4 messages of the type:
>
> If this is an issue, it might make sense to have root be a tmpfs 
> filesystem,
> and then have specific network mounts.

I'm trying to keep this a standard Debian system as much as possible.
Also I've several machines having a large number of shared partitions,
some of them fulfill different purposes, so I would need to customize
several instances which sounds like much work to me; part of it
certainly unnecessary because it works just fine with older kernels... 
:)

Also there is the issue that the only thing that is sort of guaranteed 
to
be transported over the network is the kernel itself. Sometimes it hangs
already when or just after loading init. I'm not convinced it will be
always able to transfer the whole ramdisk....

Forgot to mention: I've also seen segfaults and wrong file contents
in random places while init executes the scripts in /etc/rc*.d but
those seem to have gone away after I used a more conservative set
of kernel config options. Now it'll only hang.

>   Note--don't make "/var/log" network mounted, various apps default to 
> trying to check for files there--if the server goes away, you can't 
> log in/out.

There's unfortunately more to this. I also cannot log in if
any of the files (bash, bashrc, profiles, libraries, etc.)
needed for login are on nfs. The question here is what is more
reliable in terms of data transfer after an Oops: NFS or
syslogd (UDP). So far I'm satisfied with NFS here, so I don't
see a good reason to change.

Servus,
       Daniel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 478 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 19:07                 ` Daniel Egger
  2004-04-17  4:56                   ` Chris Friesen
@ 2004-04-17  5:24                   ` Trond Myklebust
  2004-04-17 14:15                     ` Daniel Egger
  1 sibling, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-17  5:24 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Linux Kernel

On Fri, 2004-04-16 at 12:07, Daniel Egger wrote:

> Great you want to help here. So I've a system which is NFS root using a
> 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
> water somewhere in between 10 seconds and 5 minutes after boot using
> NFS over UDP. The last thing I see are 3 or 4 messages of the type:

...and if you use TCP?

> server 192.168.11.2 not responding, still trying

The other thing I'd need is a tcpdump. Something like "tcpdump -s 9000
-w dump.out"...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17  5:24                   ` Trond Myklebust
@ 2004-04-17 14:15                     ` Daniel Egger
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Egger @ 2004-04-17 14:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

On Sat, 2004-04-17 at 07:24, Trond Myklebust wrote:

> > Great you want to help here. So I've a system which is NFS root using a
> > 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
> > water somewhere in between 10 seconds and 5 minutes after boot using
> > NFS over UDP. The last thing I see are 3 or 4 messages of the type:

> ...and if you use TCP?

My bad, I got confused; with TCP I get the hangs, with UDP the data
corruption. Unfortunately it doesn't want to hang for me me right now.
:( ...

> > server 192.168.11.2 not responding, still trying

> The other thing I'd need is a tcpdump. Something like "tcpdump -s 9000
> -w dump.out"...

but I have two different tasty cases of data corruption using NFS over
UDP traced for you which I'll send you in private. The first one
corrupts init so that it segfaults, the second one probably crashes the
rc starter to that I'm left with an unusable getty login on console.

I'll try to get the TCP problems traced as well but right now I don't
have the time to wait....

-- 
Servus,
       Daniel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 16:13               ` Trond Myklebust
  2004-04-16 19:07                 ` Daniel Egger
@ 2004-04-16 19:11                 ` Charles Shannon Hendrix
  1 sibling, 0 replies; 54+ messages in thread
From: Charles Shannon Hendrix @ 2004-04-16 19:11 UTC (permalink / raw)
  To: Linux Kernel

Fri, 16 Apr 2004 @ 09:13 -0700, Trond Myklebust said:

> Then "some of you" can send in a proper bugreport in the usual format if
> and when that problem actually occurs.
> 
> So far I have NOTHING to tell me there is a problem here. Just a load of
> people going ballistic over hot air....

Several people are reporting a problem and discussing it, but I don't
see any of them going ballistic.



-- 
shannon "AT" widomaker.com -- ["Secrecy is the beginning of tyranny." --
Unknown]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  5:29         ` Trond Myklebust
  2004-04-16  7:13           ` Paul Wagland
  2004-04-16 14:44           ` Marcelo Tosatti
@ 2004-04-17 16:44           ` Matthias Urlichs
  2004-04-17 18:15             ` Trond Myklebust
  2004-04-19  9:06             ` Helge Hafting
  2 siblings, 2 replies; 54+ messages in thread
From: Matthias Urlichs @ 2004-04-17 16:44 UTC (permalink / raw)
  To: linux-kernel

Hi, Trond Myklebust wrote:

> As for blanket statements like the above: I have seen no evidence yet
> that they are any more warranted in 2.6.x than they were in 2.4.x.

Oh, I saw the problem too: a slow client couldn't do full-size reads from
a fast server because the buffer on the client's network card was just 8k.

Granted that the client is a slow m68k Mac, but 2.4 was fast enough to get
the first packet entirely off the card before the last one overruns the
buffer -- while 2.6 has a bit more latency, so it can't.

Apparently that bit of increased latency is offset by the fact that the
machine still limps along if I packet-bomb it. Under 2.4 it locked solid,
so overall I think that the 2.6 situation is an improvement.

-- 
Matthias Urlichs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 16:44           ` Matthias Urlichs
@ 2004-04-17 18:15             ` Trond Myklebust
  2004-04-17 18:32               ` Marc Singer
  2004-04-19  9:06             ` Helge Hafting
  1 sibling, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-17 18:15 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: linux-kernel

On Sat, 2004-04-17 at 09:44, Matthias Urlichs wrote:
> Hi, Trond Myklebust wrote:
> 
> > As for blanket statements like the above: I have seen no evidence yet
> > that they are any more warranted in 2.6.x than they were in 2.4.x.
> 
> Oh, I saw the problem too: a slow client couldn't do full-size reads from
> a fast server because the buffer on the client's network card was just 8k.

Right, and this has always been a problem. I had the same issues when
doing 8k reads on one of my 75MHz Pentiums some 10 years ago. The thing
would more or less lock up and just pump out a constant stream of "time
exceeded" ICMP messages.

The NFS/RPC layer knows nothing about the existence of network cards or
their buffer sizes. Only about sockets and how to read from/write to
them.
This sort of issue is precisely why I'd prefer to see people use TCP by
default. UDP with it's dependency on fragmentation works fine on fast
setups with homogeneous lossless networks. It sucks as soon as you break
one of those conditions.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 18:15             ` Trond Myklebust
@ 2004-04-17 18:32               ` Marc Singer
  2004-04-17 18:58                 ` Trond Myklebust
  2004-04-17 19:01                 ` Daniel Egger
  0 siblings, 2 replies; 54+ messages in thread
From: Marc Singer @ 2004-04-17 18:32 UTC (permalink / raw)
  To: linux-kernel

On Sat, Apr 17, 2004 at 11:15:47AM -0700, Trond Myklebust wrote:
> This sort of issue is precisely why I'd prefer to see people use TCP by
> default. UDP with it's dependency on fragmentation works fine on fast
> setups with homogeneous lossless networks. It sucks as soon as you break
> one of those conditions.

I'd be glad to compare TCP to UDP on my system.  It's using an nfsroot
mount.  It looks like the support is there.  What activates it?
 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 18:32               ` Marc Singer
@ 2004-04-17 18:58                 ` Trond Myklebust
  2004-04-17 19:01                   ` Marc Singer
  2004-04-17 22:22                   ` Marc Singer
  2004-04-17 19:01                 ` Daniel Egger
  1 sibling, 2 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-17 18:58 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-kernel

On Sat, 2004-04-17 at 11:32, Marc Singer wrote:
> On Sat, Apr 17, 2004 at 11:15:47AM -0700, Trond Myklebust wrote:
> > This sort of issue is precisely why I'd prefer to see people use TCP by
> > default. UDP with it's dependency on fragmentation works fine on fast
> > setups with homogeneous lossless networks. It sucks as soon as you break
> > one of those conditions.
> 
> I'd be glad to compare TCP to UDP on my system.  It's using an nfsroot
> mount.  It looks like the support is there.  What activates it?

It's all there. Just use the "tcp" mount option.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 18:58                 ` Trond Myklebust
@ 2004-04-17 19:01                   ` Marc Singer
  2004-04-17 19:09                     ` Trond Myklebust
  2004-04-17 22:22                   ` Marc Singer
  1 sibling, 1 reply; 54+ messages in thread
From: Marc Singer @ 2004-04-17 19:01 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marc Singer, linux-kernel

On Sat, Apr 17, 2004 at 11:58:33AM -0700, Trond Myklebust wrote:
> > I'd be glad to compare TCP to UDP on my system.  It's using an nfsroot
> > mount.  It looks like the support is there.  What activates it?
> 
> It's all there. Just use the "tcp" mount option.

I think you are talking about the fstab mount option.  Is there a
kernel command line option for this?  That's what I've been looking
for.  I'm not using an initrd.

Cheers.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 19:01                   ` Marc Singer
@ 2004-04-17 19:09                     ` Trond Myklebust
  2004-04-17 19:19                       ` Russell King
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-17 19:09 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-kernel

On Sat, 2004-04-17 at 12:01, Marc Singer wrote:

> I think you are talking about the fstab mount option.  Is there a
> kernel command line option for this?  That's what I've been looking
> for.  I'm not using an initrd.

No. I'm talking about the built-in parser to enable NFSROOT to pass
mount options. As in:

   nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]

See Documentation/nfsroot.txt. Put "tcp" as one of the "<nfs-options>",
and your root partition will use TCP instead of UDP.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 19:09                     ` Trond Myklebust
@ 2004-04-17 19:19                       ` Russell King
  2004-04-18  2:51                         ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Russell King @ 2004-04-17 19:19 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marc Singer, linux-kernel

On Sat, Apr 17, 2004 at 12:09:24PM -0700, Trond Myklebust wrote:
> On Sat, 2004-04-17 at 12:01, Marc Singer wrote:
> 
> > I think you are talking about the fstab mount option.  Is there a
> > kernel command line option for this?  That's what I've been looking
> > for.  I'm not using an initrd.
> 
> No. I'm talking about the built-in parser to enable NFSROOT to pass
> mount options. As in:
> 
>    nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
> 
> See Documentation/nfsroot.txt. Put "tcp" as one of the "<nfs-options>",
> and your root partition will use TCP instead of UDP.

Trond,

Can you explain how this works?

static int __init root_nfs_parse(char *name, char *buf)
{
...
        while ((p = strsep (&name, ",")) != NULL) {
                int token;
                if (!*p)
                        continue;
                token = match_token(p, tokens, args);

                /* %u tokens only */
                if (match_int(&args[0], &option))
                        return 0;

Firstly, as far as I can see, args[] is uninitialised.  If match_token
doesn't touch args[] then we pass match_int some uninitialised kernel
memory.

Secondly, we seem to exit if match_int doesn't parse a number.  Not
all options in "tokens" have a number associated with them, including
ones like "tcp".

So, given that "tcp" is the only option, I think we'll end up passing
match_int() some uninitialised memory which may cause a kernel oops.
If not, it probably won't be a valid number, so we'll ignore the option.

However, it will appear to work as long as the first option has a
number associated with it (ie, is one of the first 9 options.)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 19:19                       ` Russell King
@ 2004-04-18  2:51                         ` Trond Myklebust
  2004-04-19 16:39                           ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-18  2:51 UTC (permalink / raw)
  To: Russell King; +Cc: Marc Singer, linux-kernel

On Sat, 2004-04-17 at 12:19, Russell King wrote:

> Firstly, as far as I can see, args[] is uninitialised.  If match_token
> doesn't touch args[] then we pass match_int some uninitialised kernel
> memory.
>
> Secondly, we seem to exit if match_int doesn't parse a number.  Not
> all options in "tokens" have a number associated with them, including
> ones like "tcp".

Agreed. The correct fix should be something like the appended patch. It
depends on all tokens that do take an integer argument being listed
first in the enum.

Comments?

Cheers,
  Trond
 nfsroot.c |   17 +++++++++++++----
 1 files changed, 13 insertions(+), 4 deletions(-)

--- linux-2.6.6-up/fs/nfs/nfsroot.c.orig	2004-04-17 11:05:10.000000000 -0700
+++ linux-2.6.6-up/fs/nfs/nfsroot.c	2004-04-17 18:47:05.000000000 -0700
@@ -117,11 +117,16 @@ static int mount_port __initdata = 0;		/
  ***************************************************************************/
 
 enum {
+	/* Options that take integer arguments */
 	Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin,
-	Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr,
+	Opt_acregmax, Opt_acdirmin, Opt_acdirmax,
+	/* Options that take no arguments */
+	Opt_soft, Opt_hard, Opt_intr,
 	Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac, 
 	Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp,
-	Opt_broken_suid, Opt_err,
+	Opt_broken_suid,
+	/* Error token */
+	Opt_err
 };
 
 static match_table_t tokens = {
@@ -146,9 +151,13 @@ static match_table_t tokens = {
 	{Opt_noac, "noac"},
 	{Opt_lock, "lock"},
 	{Opt_nolock, "nolock"},
+	{Opt_v2, "nfsvers=2"},
 	{Opt_v2, "v2"},
+	{Opt_v3, "nfsvers=3"},
 	{Opt_v3, "v3"},
+	{Opt_udp, "proto=udp"},
 	{Opt_udp, "udp"},
+	{Opt_udp, "proto=tcp"},
 	{Opt_tcp, "tcp"},
 	{Opt_broken_suid, "broken_suid"},
 	{Opt_err, NULL}
@@ -179,8 +188,8 @@ static int __init root_nfs_parse(char *n
 			continue;
 		token = match_token(p, tokens, args);
 
-		/* %u tokens only */
-		if (match_int(&args[0], &option))
+		/* %u tokens only. Beware if you add new tokens! */
+		if (token < Opt_soft && match_int(&args[0], &option))
 			return 0;
 		switch (token) {
 			case Opt_port:


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18  2:51                         ` Trond Myklebust
@ 2004-04-19 16:39                           ` Trond Myklebust
  2004-04-19 21:10                             ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-19 16:39 UTC (permalink / raw)
  To: Russell King; +Cc: Marc Singer, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

On Sat, 2004-04-17 at 22:51, Trond Myklebust wrote:
> On Sat, 2004-04-17 at 12:19, Russell King wrote:
> 
> > Firstly, as far as I can see, args[] is uninitialised.  If match_token
> > doesn't touch args[] then we pass match_int some uninitialised kernel
> > memory.
> >
> > Secondly, we seem to exit if match_int doesn't parse a number.  Not
> > all options in "tokens" have a number associated with them, including
> > ones like "tcp".
> 
> Agreed. The correct fix should be something like the appended patch. It
> depends on all tokens that do take an integer argument being listed
> first in the enum.
> 
> Comments?

It turned out there were a few extra issues that weren't fixed by the
previous patch. Thanks to boris@macbeth.rhoen.de for helping debug them.

Hopefully this will be the final set of fixes.

Cheers,
  Trond



[-- Attachment #2: Type: text/plain, Size: 2951 bytes --]

 nfsroot.c |   33 +++++++++++++++++++++------------
 1 files changed, 21 insertions(+), 12 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.6-01-soft/fs/nfs/nfsroot.c linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c
--- linux-2.6.6-01-soft/fs/nfs/nfsroot.c	2004-04-17 23:01:09.000000000 -0400
+++ linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c	2004-04-19 12:08:31.000000000 -0400
@@ -117,11 +117,16 @@ static int mount_port __initdata = 0;		/
  ***************************************************************************/
 
 enum {
+	/* Options that take integer arguments */
 	Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin,
-	Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr,
+	Opt_acregmax, Opt_acdirmin, Opt_acdirmax,
+	/* Options that take no arguments */
+	Opt_soft, Opt_hard, Opt_intr,
 	Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac, 
 	Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp,
-	Opt_broken_suid, Opt_err,
+	Opt_broken_suid,
+	/* Error token */
+	Opt_err
 };
 
 static match_table_t tokens = {
@@ -146,9 +151,13 @@ static match_table_t tokens = {
 	{Opt_noac, "noac"},
 	{Opt_lock, "lock"},
 	{Opt_nolock, "nolock"},
+	{Opt_v2, "nfsvers=2"},
 	{Opt_v2, "v2"},
+	{Opt_v3, "nfsvers=3"},
 	{Opt_v3, "v3"},
+	{Opt_udp, "proto=udp"},
 	{Opt_udp, "udp"},
+	{Opt_tcp, "proto=tcp"},
 	{Opt_tcp, "tcp"},
 	{Opt_broken_suid, "broken_suid"},
 	{Opt_err, NULL}
@@ -162,25 +171,21 @@ static match_table_t tokens = {
 static int __init root_nfs_parse(char *name, char *buf)
 {
 
-	char *p;
+	char *p, *path = name;
 	substring_t args[MAX_OPT_ARGS];
 	int option;
 
 	if (!name)
 		return 1;
 
-	if (name[0] && strcmp(name, "default")){
-		strlcpy(buf, name, NFS_MAXPATHLEN);
-		return 1;
-	}
 	while ((p = strsep (&name, ",")) != NULL) {
 		int token; 
 		if (!*p)
 			continue;
 		token = match_token(p, tokens, args);
 
-		/* %u tokens only */
-		if (match_int(&args[0], &option))
+		/* %u tokens only. Beware if you add new tokens! */
+		if (token < Opt_soft && match_int(&args[0], &option))
 			return 0;
 		switch (token) {
 			case Opt_port:
@@ -265,6 +270,13 @@ static int __init root_nfs_parse(char *n
 				return 0;
 		}
 	}
+
+	/*
+	 * Copy the NFS remote path to the output buffer.
+	 * Relies on strsep() having converted the delimiting ',' to '\0'.
+	 */
+	if (path[0] != '\0' && strcmp(path, "default") != 0)
+		strlcpy(buf, path, NFS_MAXPATHLEN);
 	return 1;
 }
 
@@ -283,9 +295,6 @@ static int __init root_nfs_name(char *na
 	nfs_data.flags    = NFS_MOUNT_NONLM;	/* No lockd in nfs root yet */
 	nfs_data.rsize    = NFS_DEF_FILE_IO_BUFFER_SIZE;
 	nfs_data.wsize    = NFS_DEF_FILE_IO_BUFFER_SIZE;
-	nfs_data.bsize	  = 0;
-	nfs_data.timeo    = 7;
-	nfs_data.retrans  = 3;
 	nfs_data.acregmin = 3;
 	nfs_data.acregmax = 60;
 	nfs_data.acdirmin = 30;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-19 16:39                           ` Trond Myklebust
@ 2004-04-19 21:10                             ` Trond Myklebust
  0 siblings, 0 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-19 21:10 UTC (permalink / raw)
  To: Russell King; +Cc: Marc Singer, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 417 bytes --]

On Mon, 2004-04-19 at 12:39, Trond Myklebust wrote:

> It turned out there were a few extra issues that weren't fixed by the
> previous patch. Thanks to boris@macbeth.rhoen.de for helping debug them.
> 
> Hopefully this will be the final set of fixes.

Sigh. It wasn't... The remote path was still not getting set properly.
Here's the final version. Tested, and it should now work according to
spec!

Cheers,
  Trond

[-- Attachment #2: Type: text/plain, Size: 2718 bytes --]

 nfsroot.c |   30 +++++++++++++++++++-----------
 1 files changed, 19 insertions(+), 11 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.6-01-soft/fs/nfs/nfsroot.c linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c
--- linux-2.6.6-01-soft/fs/nfs/nfsroot.c	2004-04-19 12:27:51.000000000 -0400
+++ linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c	2004-04-19 16:26:12.000000000 -0400
@@ -117,11 +117,16 @@ static int mount_port __initdata = 0;		/
  ***************************************************************************/
 
 enum {
+	/* Options that take integer arguments */
 	Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin,
-	Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr,
+	Opt_acregmax, Opt_acdirmin, Opt_acdirmax,
+	/* Options that take no arguments */
+	Opt_soft, Opt_hard, Opt_intr,
 	Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac, 
 	Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp,
-	Opt_broken_suid, Opt_err,
+	Opt_broken_suid,
+	/* Error token */
+	Opt_err
 };
 
 static match_table_t tokens = {
@@ -146,9 +151,13 @@ static match_table_t tokens = {
 	{Opt_noac, "noac"},
 	{Opt_lock, "lock"},
 	{Opt_nolock, "nolock"},
+	{Opt_v2, "nfsvers=2"},
 	{Opt_v2, "v2"},
+	{Opt_v3, "nfsvers=3"},
 	{Opt_v3, "v3"},
+	{Opt_udp, "proto=udp"},
 	{Opt_udp, "udp"},
+	{Opt_tcp, "proto=tcp"},
 	{Opt_tcp, "tcp"},
 	{Opt_broken_suid, "broken_suid"},
 	{Opt_err, NULL}
@@ -169,18 +178,19 @@ static int __init root_nfs_parse(char *n
 	if (!name)
 		return 1;
 
-	if (name[0] && strcmp(name, "default")){
-		strlcpy(buf, name, NFS_MAXPATHLEN);
-		return 1;
-	}
+	/* Set the NFS remote path */
+	p = strsep(&name, ",");
+	if (p[0] != '\0' && strcmp(p, "default") != 0)
+		strlcpy(buf, p, NFS_MAXPATHLEN);
+
 	while ((p = strsep (&name, ",")) != NULL) {
 		int token; 
 		if (!*p)
 			continue;
 		token = match_token(p, tokens, args);
 
-		/* %u tokens only */
-		if (match_int(&args[0], &option))
+		/* %u tokens only. Beware if you add new tokens! */
+		if (token < Opt_soft && match_int(&args[0], &option))
 			return 0;
 		switch (token) {
 			case Opt_port:
@@ -265,6 +275,7 @@ static int __init root_nfs_parse(char *n
 				return 0;
 		}
 	}
+
 	return 1;
 }
 
@@ -283,9 +294,6 @@ static int __init root_nfs_name(char *na
 	nfs_data.flags    = NFS_MOUNT_NONLM;	/* No lockd in nfs root yet */
 	nfs_data.rsize    = NFS_DEF_FILE_IO_BUFFER_SIZE;
 	nfs_data.wsize    = NFS_DEF_FILE_IO_BUFFER_SIZE;
-	nfs_data.bsize	  = 0;
-	nfs_data.timeo    = 7;
-	nfs_data.retrans  = 3;
 	nfs_data.acregmin = 3;
 	nfs_data.acregmax = 60;
 	nfs_data.acdirmin = 30;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 18:58                 ` Trond Myklebust
  2004-04-17 19:01                   ` Marc Singer
@ 2004-04-17 22:22                   ` Marc Singer
  2004-04-18  0:57                     ` Trond Myklebust
  1 sibling, 1 reply; 54+ messages in thread
From: Marc Singer @ 2004-04-17 22:22 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marc Singer, linux-kernel

On Sat, Apr 17, 2004 at 11:58:33AM -0700, Trond Myklebust wrote:
> > I'd be glad to compare TCP to UDP on my system.  It's using an nfsroot
> > mount.  It looks like the support is there.  What activates it?
> 
> It's all there. Just use the "tcp" mount option.
> 

I have a data point for comparison.

I'm copying a 40MiB file over NFS.  In five trials, the mean transfer
times are

  UDP (v2):  48.5s
  TCP (v3):  52.7s


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 22:22                   ` Marc Singer
@ 2004-04-18  0:57                     ` Trond Myklebust
  2004-04-18  5:01                       ` Marc Singer
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-18  0:57 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-kernel

On Sat, 2004-04-17 at 15:22, Marc Singer wrote:
> I have a data point for comparison.
> 
> I'm copying a 40MiB file over NFS.  In five trials, the mean transfer
> times are
> 
>   UDP (v2):  48.5s
>   TCP (v3):  52.7s

Against what kind of server on what kind of network, with what kind of
mount options?
The above would be quite reasonable performance on a 10Mbit network
against a filer or a Linux server with the (insecure) "async" option
set.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18  0:57                     ` Trond Myklebust
@ 2004-04-18  5:01                       ` Marc Singer
  2004-04-18  6:36                         ` Chris Friesen
  0 siblings, 1 reply; 54+ messages in thread
From: Marc Singer @ 2004-04-18  5:01 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marc Singer, linux-kernel

On Sat, Apr 17, 2004 at 05:57:46PM -0700, Trond Myklebust wrote:
> On Sat, 2004-04-17 at 15:22, Marc Singer wrote:
> > I have a data point for comparison.
> > 
> > I'm copying a 40MiB file over NFS.  In five trials, the mean transfer
> > times are
> > 
> >   UDP (v2):  48.5s
> >   TCP (v3):  52.7s
> 
> Against what kind of server on what kind of network, with what kind of
> mount options?
> The above would be quite reasonable performance on a 10Mbit network
> against a filer or a Linux server with the (insecure) "async" option
> set.

Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
kernel nfs daemon; network is 100Mib.  There is nothing else on the
network except intermittent broadband traffic.  Async is set on the
server side.

While I have seen much worse performance in the last couple of weeks,
I cannot blame NFS when I look at the numbers.
 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18  5:01                       ` Marc Singer
@ 2004-04-18  6:36                         ` Chris Friesen
  2004-04-18  7:56                           ` Russell King
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Friesen @ 2004-04-18  6:36 UTC (permalink / raw)
  To: Marc Singer; +Cc: Trond Myklebust, linux-kernel

Marc Singer wrote:

> Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
> kernel nfs daemon; network is 100Mib.  There is nothing else on the
> network except intermittent broadband traffic.  Async is set on the
> server side.

Is the ARM that slow?  under 2MB/s seems odd to me...but them maybe I'm 
used to faster machines.

Chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18  6:36                         ` Chris Friesen
@ 2004-04-18  7:56                           ` Russell King
  2004-04-18 17:31                             ` Marc Singer
  0 siblings, 1 reply; 54+ messages in thread
From: Russell King @ 2004-04-18  7:56 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Marc Singer, Trond Myklebust, linux-kernel

On Sun, Apr 18, 2004 at 02:36:14AM -0400, Chris Friesen wrote:
> Marc Singer wrote:
> 
> > Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
> > kernel nfs daemon; network is 100Mib.  There is nothing else on the
> > network except intermittent broadband traffic.  Async is set on the
> > server side.
> 
> Is the ARM that slow?  under 2MB/s seems odd to me...but them maybe I'm 
> used to faster machines.

It's probably the SMC91c111 ether chip causing all the problem - it's
only able to store about 4 packets before it starts dropping, which
isn't that much on a 100mbit network.

Running with rsize=4096 works wonders with this chip.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18  7:56                           ` Russell King
@ 2004-04-18 17:31                             ` Marc Singer
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Singer @ 2004-04-18 17:31 UTC (permalink / raw)
  To: Chris Friesen, Marc Singer, Trond Myklebust, linux-kernel

On Sun, Apr 18, 2004 at 08:56:19AM +0100, Russell King wrote:
> On Sun, Apr 18, 2004 at 02:36:14AM -0400, Chris Friesen wrote:
> > Marc Singer wrote:
> > 
> > > Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
> > > kernel nfs daemon; network is 100Mib.  There is nothing else on the
> > > network except intermittent broadband traffic.  Async is set on the
> > > server side.
> > 
> > Is the ARM that slow?  under 2MB/s seems odd to me...but them maybe I'm 
> > used to faster machines.
> 
> It's probably the SMC91c111 ether chip causing all the problem - it's
> only able to store about 4 packets before it starts dropping, which
> isn't that much on a 100mbit network.

I suspect that it might be a CPU issue.  On transmit only, it never
gets above 18Mib.

> Running with rsize=4096 works wonders with this chip.

Already there.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 18:32               ` Marc Singer
  2004-04-17 18:58                 ` Trond Myklebust
@ 2004-04-17 19:01                 ` Daniel Egger
  2004-04-17 20:22                   ` Marc Singer
  1 sibling, 1 reply; 54+ messages in thread
From: Daniel Egger @ 2004-04-17 19:01 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 433 bytes --]

On 17.04.2004, at 20:32, Marc Singer wrote:

> I'd be glad to compare TCP to UDP on my system.  It's using an nfsroot
> mount.  It looks like the support is there.  What activates it?

You need to add at least tcp as parameter to the nfsroot boot option,
like nfsroot=1.1.1.1:/tftpboot/foo,tcp,v3 .

And, of course, if you mount/remount NFS partitions you also need to
specify the tcp parameter in your fstab.

Servus,
       Daniel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 478 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 19:01                 ` Daniel Egger
@ 2004-04-17 20:22                   ` Marc Singer
  2004-04-18 11:14                     ` Daniel Egger
  0 siblings, 1 reply; 54+ messages in thread
From: Marc Singer @ 2004-04-17 20:22 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Marc Singer, linux-kernel

On Sat, Apr 17, 2004 at 09:01:38PM +0200, Daniel Egger wrote:
> On 17.04.2004, at 20:32, Marc Singer wrote:
> 
> >I'd be glad to compare TCP to UDP on my system.  It's using an nfsroot
> >mount.  It looks like the support is there.  What activates it?
> 
> You need to add at least tcp as parameter to the nfsroot boot option,
> like nfsroot=1.1.1.1:/tftpboot/foo,tcp,v3 .

What I'd like to do is use a command line like this

  root=/dev/nfs ip=rarp nfsroot=,tcp,v3

But, it doesn't work.  I'd like to let the kernel autoconfiguration
handle the addressing.

> And, of course, if you mount/remount NFS partitions you also need to
> specify the tcp parameter in your fstab.
> 
> Servus,
>       Daniel



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 20:22                   ` Marc Singer
@ 2004-04-18 11:14                     ` Daniel Egger
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Egger @ 2004-04-18 11:14 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 665 bytes --]

On 17.04.2004, at 22:22, Marc Singer wrote:

> What I'd like to do is use a command line like this
>
>   root=/dev/nfs ip=rarp nfsroot=,tcp,v3
>
> But, it doesn't work.  I'd like to let the kernel autoconfiguration
> handle the addressing.

According to Documentation/nfsroot.txt you should be able
to do:

root=/dev/nfs ip=rarp nfsroot=/kernel,tcp,v3

i.e. the ip is optional. Just out of curiosity: How would you
supply the kernel name using rarp/bootp/dhcp? Since a few days
I'm using pxelinux but before that I needed to hardcode the
path into the tagged image. Actually I prefer this to restarting
the restarting the dhcp server, but...

Servus,
       Daniel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 478 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 16:44           ` Matthias Urlichs
  2004-04-17 18:15             ` Trond Myklebust
@ 2004-04-19  9:06             ` Helge Hafting
  1 sibling, 0 replies; 54+ messages in thread
From: Helge Hafting @ 2004-04-19  9:06 UTC (permalink / raw)
  To: Matthias Urlichs, linux-kernel

Matthias Urlichs wrote:
> Hi, Trond Myklebust wrote:
> 
> 
>>As for blanket statements like the above: I have seen no evidence yet
>>that they are any more warranted in 2.6.x than they were in 2.4.x.
> 
> 
> Oh, I saw the problem too: a slow client couldn't do full-size reads from
> a fast server because the buffer on the client's network card was just 8k.
> 
You can force nfs to use smaller packets, useful for those who
have to use udp because the server doesn't support nfs over tcp.
Try 8k, or even 4k.

Helge Hafting


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  1:53   ` Andrew Morton
  2004-04-16  2:54     ` Trond Myklebust
@ 2004-04-16  9:03     ` Jamie Lokier
  2004-04-16 15:55       ` Trond Myklebust
  1 sibling, 1 reply; 54+ messages in thread
From: Jamie Lokier @ 2004-04-16  9:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Trond Myklebust, shannon, linux-kernel

Andrew Morton wrote:
> > På to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix:
> But Charles was seeing good performance with 2.4-based clients.  When he
> went to 2.6 everything fell apart.

Perhaps because 2.6 changes the UDP retransmit model for NFS, to
estimate the round-trip time and thus retransmit faster than 2.4
would.  Sometimes _much_ faster: I observed retransmits within a few
hundred microseconds.

On networks with a lot of latency variance, i.e. anything with big
queues, that would increase congestion.  That'd increase losses, and
because NFS over UDP uses large fragmented IP frames (TCP doesn't),
fragment loss will greatly increase IP frame loss, as Trond explained.

That's my hypothesis.

There was also a problem with late 2.5 clients and "soft" NFS mounts.
Requests would timeout after a fixed number of retransmits, which on a
LAN could be after a few milliseconds due to round-trip estimation and
fast server response.  Then when an I/O on the server took longer,
e.g. due to a disk seek or contention, the client would timeout and
abort requests.  2.4 doesn't have this problem with "soft" due to the
longer, fixed retransmit timeout.  I don't know if it is fixed in
current 2.6 kernels - but you can avoid it by not using "soft" anyway.

-- Jamie

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16  9:03     ` Jamie Lokier
@ 2004-04-16 15:55       ` Trond Myklebust
  2004-04-16 18:48         ` Jamie Lokier
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-16 15:55 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, shannon, linux-kernel

On Fri, 2004-04-16 at 02:03, Jamie Lokier wrote:

> Perhaps because 2.6 changes the UDP retransmit model for NFS, to
> estimate the round-trip time and thus retransmit faster than 2.4
> would.  Sometimes _much_ faster: I observed retransmits within a few
> hundred microseconds.

Retransmits within a few 100 microsecond should no longer be occurring.
Have you redone those measurements with a more recent kernel?
2.6.x and 2.4.x should have pretty much the same code for RTO
estimation.

In fact pretty much all the 2.4.x and 2.6.x RPC code is shared. The one
difference is that 2.6.x uses zero copy writes.


> There was also a problem with late 2.5 clients and "soft" NFS mounts.
> Requests would timeout after a fixed number of retransmits, which on a
> LAN could be after a few milliseconds due to round-trip estimation and
> fast server response.  Then when an I/O on the server took longer,
> e.g. due to a disk seek or contention, the client would timeout and
> abort requests.  2.4 doesn't have this problem with "soft" due to the
> longer, fixed retransmit timeout.  I don't know if it is fixed in
> current 2.6 kernels - but you can avoid it by not using "soft" anyway.

Or changing the default value of "retrans" to something more sane. As
usual, Linux has a default that is lower than on any other platform.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 15:55       ` Trond Myklebust
@ 2004-04-16 18:48         ` Jamie Lokier
  2004-04-16 19:06           ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Jamie Lokier @ 2004-04-16 18:48 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, shannon, linux-kernel

Trond Myklebust wrote:
> > Perhaps because 2.6 changes the UDP retransmit model for NFS, to
> > estimate the round-trip time and thus retransmit faster than 2.4
> > would.  Sometimes _much_ faster: I observed retransmits within a few
> > hundred microseconds.
> 
> Retransmits within a few 100 microsecond should no longer be occurring.
> Have you redone those measurements with a more recent kernel?

No, not since I sent you the packet trace from a 2.5 kernel that
wasn't working with "soft".  I took your advice and stopped using
"soft".  It causes the obvious problem when I (rarely) turn off the
server, otherwise it's been fine and I'm using 2.6.5 now, still fine
(with "soft" not being used).

> 2.6.x and 2.4.x should have pretty much the same code for RTO
> estimation.
> 
> In fact pretty much all the 2.4.x and 2.6.x RPC code is shared. The one
> difference is that 2.6.x uses zero copy writes.
> 
> > There was also a problem with late 2.5 clients and "soft" NFS mounts.
> > Requests would timeout after a fixed number of retransmits, which on a
> > LAN could be after a few milliseconds due to round-trip estimation and
> > fast server response.  Then when an I/O on the server took longer,
> > e.g. due to a disk seek or contention, the client would timeout and
> > abort requests.  2.4 doesn't have this problem with "soft" due to the
> > longer, fixed retransmit timeout.  I don't know if it is fixed in
> > current 2.6 kernels - but you can avoid it by not using "soft" anyway.
> 
> Or changing the default value of "retrans" to something more sane. As
> usual, Linux has a default that is lower than on any other platform.

If few-100-microsecond retransmits no longer occur, perhaps it's no
longer relevant.

The problem I saw with "soft" was that the retransmit time was quite a
good estimate of the server response time.  That part was fine, nice
even.  But then the server response latency would increase by a factor
of 10000 (ten thousand) due to normal disk I/O activity (compare cache
response with disk response on a busy disk), and of course 3
retransmits doubling each time is not adequate to cover that.  2.4 was
fine because the default rtt and retrans together could never get
shorter than a few seconds.

That's why I felt that iff rtt was adapting to the server response
time, then a fixed number of retransmits was no longer appropriate: a
lower bound on the time before timing out is appropriate, e.g. 3
seconds or 10 seconds or whatever.

In other words, with adaptive rtt the concept of "retrans" being a
fixed number is fundamentally flawed -- unless it's also accompanied
by a minimum timeout time.  You'd need a retrans value of 20 or so for
the above perfectly normal LAN situation, but then that's far too
large on other occasions with other networks or servers.

-- Jamie

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 18:48         ` Jamie Lokier
@ 2004-04-16 19:06           ` Trond Myklebust
  2004-04-16 19:39             ` Jamie Lokier
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-16 19:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, shannon, linux-kernel

On Fri, 2004-04-16 at 11:48, Jamie Lokier wrote:

> In other words, with adaptive rtt the concept of "retrans" being a
> fixed number is fundamentally flawed -- unless it's also accompanied
> by a minimum timeout time.  You'd need a retrans value of 20 or so for
> the above perfectly normal LAN situation, but then that's far too
> large on other occasions with other networks or servers.

At that point, it makes sense to drop the entire "retrans+timeo"
paradigm, and just state that soft timeouts take a single parameter
("timeo") that determines the timeout value.

That's something that is dead easy to do...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 19:06           ` Trond Myklebust
@ 2004-04-16 19:39             ` Jamie Lokier
  2004-04-17 22:32               ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Jamie Lokier @ 2004-04-16 19:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, shannon, linux-kernel

Trond Myklebust wrote:
> > In other words, with adaptive rtt the concept of "retrans" being a
> > fixed number is fundamentally flawed -- unless it's also accompanied
> > by a minimum timeout time.  You'd need a retrans value of 20 or so for
> > the above perfectly normal LAN situation, but then that's far too
> > large on other occasions with other networks or servers.
> 
> At that point, it makes sense to drop the entire "retrans+timeo"
> paradigm, and just state that soft timeouts take a single parameter
> ("timeo") that determines the timeout value.

I agree.  30 seconds seems like a good default.

> That's something that is dead easy to do...

I'll test a patch for 2.6.5 if you provide one.

-- Jamie

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-16 19:39             ` Jamie Lokier
@ 2004-04-17 22:32               ` Trond Myklebust
  2004-04-18  3:26                 ` Jamie Lokier
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-17 22:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

On Fri, 2004-04-16 at 12:39, Jamie Lokier wrote:

> > That's something that is dead easy to do...
> 
> I'll test a patch for 2.6.5 if you provide one.

Here you go...

With this patch
        - the major timeout is of fixed length "timeo<<retrans", and the
        clock starts at the first attempt to send the packet.
        - If a major timeout occurs, we now reset the RTT estimator so
        as to "slow start" when the server becomes available again.

For the moment it does use the timeo + retrans values, because the
former is in fact wanted in order to initialize the RTT estimator.
However, it no longer uses the count of the number of actual
retransmissions in order to determine whether or not a major timeout
occurred.

Cheers,
  Trond



[-- Attachment #2: linux-2.6.6-01-soft.dif --]
[-- Type: text/plain, Size: 9304 bytes --]

 include/linux/sunrpc/xprt.h    |   10 ++--
 net/sunrpc/auth_gss/auth_gss.c |    2 
 net/sunrpc/clnt.c              |    4 -
 net/sunrpc/timer.c             |    1 
 net/sunrpc/xprt.c              |   91 +++++++++++++++++++++++++----------------
 5 files changed, 63 insertions(+), 45 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/include/linux/sunrpc/xprt.h linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h
--- linux-2.6.6-pre1/include/linux/sunrpc/xprt.h	2004-04-17 11:05:10.000000000 -0700
+++ linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h	2004-04-17 13:55:40.000000000 -0700
@@ -69,8 +69,7 @@ extern unsigned int xprt_tcp_slot_table_
  * This describes a timeout strategy
  */
 struct rpc_timeout {
-	unsigned long		to_current,		/* current timeout */
-				to_initval,		/* initial timeout */
+	unsigned long		to_initval,		/* initial timeout */
 				to_maxval,		/* max timeout */
 				to_increment;		/* if !exponential */
 	unsigned int		to_retries;		/* max # of retries */
@@ -85,7 +84,6 @@ struct rpc_rqst {
 	 * This is the user-visible part
 	 */
 	struct rpc_xprt *	rq_xprt;		/* RPC client */
-	struct rpc_timeout	rq_timeout;		/* timeout parms */
 	struct xdr_buf		rq_snd_buf;		/* send buffer */
 	struct xdr_buf		rq_rcv_buf;		/* recv buffer */
 
@@ -103,6 +101,9 @@ struct rpc_rqst {
 	struct xdr_buf		rq_private_buf;		/* The receive buffer
 							 * used in the softirq.
 							 */
+	unsigned long		rq_majortimeo;	/* major timeout alarm */
+	unsigned long		rq_timeout;	/* Current timeout value */
+	unsigned int		rq_retries;	/* # of retries */
 	/*
 	 * For authentication (e.g. auth_des)
 	 */
@@ -115,7 +116,6 @@ struct rpc_rqst {
 	u32			rq_bytes_sent;	/* Bytes we have sent */
 
 	unsigned long		rq_xtime;	/* when transmitted */
-	int			rq_ntimeo;
 	int			rq_ntrans;
 };
 #define rq_svec			rq_snd_buf.head
@@ -210,7 +210,7 @@ void			xprt_reserve(struct rpc_task *);
 int			xprt_prepare_transmit(struct rpc_task *);
 void			xprt_transmit(struct rpc_task *);
 void			xprt_receive(struct rpc_task *);
-int			xprt_adjust_timeout(struct rpc_timeout *);
+int			xprt_adjust_timeout(struct rpc_rqst *req);
 void			xprt_release(struct rpc_task *);
 void			xprt_connect(struct rpc_task *);
 int			xprt_clear_backlog(struct rpc_xprt *);
diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/auth_gss/auth_gss.c linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c
--- linux-2.6.6-pre1/net/sunrpc/auth_gss/auth_gss.c	2004-04-17 11:04:59.000000000 -0700
+++ linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c	2004-04-17 14:31:29.000000000 -0700
@@ -736,10 +736,8 @@ static int
 gss_refresh(struct rpc_task *task)
 {
 	struct rpc_clnt *clnt = task->tk_client;
-	struct rpc_xprt *xprt = task->tk_xprt;
 	struct rpc_cred *cred = task->tk_msg.rpc_cred;
 
-	task->tk_timeout = xprt->timeout.to_current;
 	if (!gss_cred_is_uptodate_ctx(cred))
 		return gss_upcall(clnt, task, cred);
 	return 0;
diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/clnt.c linux-2.6.6-01-soft/net/sunrpc/clnt.c
--- linux-2.6.6-pre1/net/sunrpc/clnt.c	2004-04-17 11:04:57.000000000 -0700
+++ linux-2.6.6-01-soft/net/sunrpc/clnt.c	2004-04-17 15:05:14.000000000 -0700
@@ -788,13 +788,11 @@ static void
 call_timeout(struct rpc_task *task)
 {
 	struct rpc_clnt	*clnt = task->tk_client;
-	struct rpc_timeout *to = &task->tk_rqstp->rq_timeout;
 
-	if (xprt_adjust_timeout(to)) {
+	if (xprt_adjust_timeout(task->tk_rqstp) == 0) {
 		dprintk("RPC: %4d call_timeout (minor)\n", task->tk_pid);
 		goto retry;
 	}
-	to->to_retries = clnt->cl_timeout.to_retries;
 
 	dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid);
 	if (RPC_IS_SOFT(task)) {
diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/timer.c linux-2.6.6-01-soft/net/sunrpc/timer.c
--- linux-2.6.6-pre1/net/sunrpc/timer.c	2004-04-17 11:05:23.000000000 -0700
+++ linux-2.6.6-01-soft/net/sunrpc/timer.c	2004-04-17 15:02:33.000000000 -0700
@@ -39,6 +39,7 @@ rpc_init_rtt(struct rpc_rtt *rt, unsigne
 	for (i = 0; i < 5; i++) {
 		rt->srtt[i] = init;
 		rt->sdrtt[i] = RPC_RTO_INIT;
+		rt->ntimeouts[i] = 0;
 	}
 }
 
diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/xprt.c linux-2.6.6-01-soft/net/sunrpc/xprt.c
--- linux-2.6.6-pre1/net/sunrpc/xprt.c	2004-04-17 11:05:09.000000000 -0700
+++ linux-2.6.6-01-soft/net/sunrpc/xprt.c	2004-04-17 15:21:56.000000000 -0700
@@ -352,35 +352,59 @@ xprt_adjust_cwnd(struct rpc_xprt *xprt, 
 }
 
 /*
+ * Reset the major timeout value
+ */
+static void xprt_reset_majortimeo(struct rpc_rqst *req)
+{
+	struct rpc_timeout *to = &req->rq_xprt->timeout;
+
+	req->rq_majortimeo = req->rq_timeout;
+	if (to->to_exponential)
+		req->rq_majortimeo <<= to->to_retries;
+	else
+		req->rq_majortimeo += to->to_increment * to->to_retries;
+	if (req->rq_majortimeo > to->to_maxval || req->rq_majortimeo == 0)
+		req->rq_majortimeo = to->to_maxval;
+	req->rq_majortimeo += jiffies;
+}
+
+/*
  * Adjust timeout values etc for next retransmit
  */
-int
-xprt_adjust_timeout(struct rpc_timeout *to)
+int xprt_adjust_timeout(struct rpc_rqst *req)
 {
-	if (to->to_retries > 0) {
-		if (to->to_exponential)
-			to->to_current <<= 1;
-		else
-			to->to_current += to->to_increment;
-		if (to->to_maxval && to->to_current >= to->to_maxval)
-			to->to_current = to->to_maxval;
+	struct rpc_xprt *xprt = req->rq_xprt;
+	struct rpc_timeout *to = &xprt->timeout;
+	int status = 0;
+
+	if (time_before(jiffies, req->rq_majortimeo)) {
+		if (req->rq_retries < to->to_retries) {
+			if (to->to_exponential)
+				req->rq_timeout <<= 1;
+			else
+				req->rq_timeout += to->to_increment;
+			if (to->to_maxval && req->rq_timeout >= to->to_maxval)
+				req->rq_timeout = to->to_maxval;
+			req->rq_retries++;
+		}
+		pprintk("RPC: %lu retrans\n", jiffies);
 	} else {
-		if (to->to_exponential)
-			to->to_initval <<= 1;
-		else
-			to->to_initval += to->to_increment;
-		if (to->to_maxval && to->to_initval >= to->to_maxval)
-			to->to_initval = to->to_maxval;
-		to->to_current = to->to_initval;
-	}
-
-	if (!to->to_current) {
-		printk(KERN_WARNING "xprt_adjust_timeout: to_current = 0!\n");
-		to->to_current = 5 * HZ;
-	}
-	pprintk("RPC: %lu %s\n", jiffies,
-			to->to_retries? "retrans" : "timeout");
-	return to->to_retries-- > 0;
+		req->rq_timeout = to->to_initval;
+		req->rq_retries = 0;
+		xprt_reset_majortimeo(req);
+		/* Reset the RTT counters == "slow start" */
+		spin_lock_bh(&xprt->sock_lock);
+		rpc_init_rtt(req->rq_task->tk_client->cl_rtt, to->to_initval);
+		spin_unlock_bh(&xprt->sock_lock);
+		pprintk("RPC: %lu timeout\n", jiffies);
+		status = -ETIMEDOUT;
+	}
+
+	if (req->rq_timeout == 0) {
+		printk(KERN_WARNING "xprt_adjust_timeout: rq_timeout = 0!\n");
+		req->rq_timeout = 5 * HZ;
+	}
+	return status;
 }
 
 /*
@@ -1166,6 +1190,7 @@ xprt_transmit(struct rpc_task *task)
 			/* Add request to the receive list */
 			list_add_tail(&req->rq_list, &xprt->recv);
 			spin_unlock_bh(&xprt->sock_lock);
+			xprt_reset_majortimeo(req);
 		}
 	} else if (!req->rq_bytes_sent)
 		return;
@@ -1221,7 +1246,7 @@ xprt_transmit(struct rpc_task *task)
 			if (!xprt_connected(xprt))
 				task->tk_status = -ENOTCONN;
 			else if (test_bit(SOCK_NOSPACE, &xprt->sock->flags)) {
-				task->tk_timeout = req->rq_timeout.to_current;
+				task->tk_timeout = req->rq_timeout;
 				rpc_sleep_on(&xprt->pending, task, NULL, NULL);
 			}
 			spin_unlock_bh(&xprt->sock_lock);
@@ -1248,13 +1273,11 @@ xprt_transmit(struct rpc_task *task)
 	if (!xprt->nocong) {
 		int timer = task->tk_msg.rpc_proc->p_timer;
 		task->tk_timeout = rpc_calc_rto(clnt->cl_rtt, timer);
-		task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer);
-		task->tk_timeout <<= clnt->cl_timeout.to_retries
-			- req->rq_timeout.to_retries;
-		if (task->tk_timeout > req->rq_timeout.to_maxval)
-			task->tk_timeout = req->rq_timeout.to_maxval;
+		task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer) + req->rq_retries;
+		if (task->tk_timeout > xprt->timeout.to_maxval || task->tk_timeout == 0)
+			task->tk_timeout = xprt->timeout.to_maxval;
 	} else
-		task->tk_timeout = req->rq_timeout.to_current;
+		task->tk_timeout = req->rq_timeout;
 	/* Don't race with disconnect */
 	if (!xprt_connected(xprt))
 		task->tk_status = -ENOTCONN;
@@ -1324,7 +1347,7 @@ xprt_request_init(struct rpc_task *task,
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
 
-	req->rq_timeout = xprt->timeout;
+	req->rq_timeout = xprt->timeout.to_initval;
 	req->rq_task	= task;
 	req->rq_xprt    = xprt;
 	req->rq_xid     = xprt_alloc_xid(xprt);
@@ -1381,7 +1404,6 @@ xprt_default_timeout(struct rpc_timeout 
 void
 xprt_set_timeout(struct rpc_timeout *to, unsigned int retr, unsigned long incr)
 {
-	to->to_current   = 
 	to->to_initval   = 
 	to->to_increment = incr;
 	to->to_maxval    = incr * retr;
@@ -1446,7 +1468,6 @@ xprt_setup(int proto, struct sockaddr_in
 	/* Set timeout parameters */
 	if (to) {
 		xprt->timeout = *to;
-		xprt->timeout.to_current = to->to_initval;
 	} else
 		xprt_default_timeout(&xprt->timeout, xprt->prot);
 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 22:32               ` Trond Myklebust
@ 2004-04-18  3:26                 ` Jamie Lokier
  2004-04-18  7:03                   ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Jamie Lokier @ 2004-04-18  3:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

Trond Myklebust wrote:
> With this patch
>         - the major timeout is of fixed length "timeo<<retrans", and the
>         clock starts at the first attempt to send the packet.
>         - If a major timeout occurs, we now reset the RTT estimator so
>         as to "slow start" when the server becomes available again.
> 
> For the moment it does use the timeo + retrans values, because the
> former is in fact wanted in order to initialize the RTT estimator.
> However, it no longer uses the count of the number of actual
> retransmissions in order to determine whether or not a major timeout
> occurred.

Ok, observations:

    - The RTT converges to 0.1s on my LAN, just as it did before the patch.
      Very sensible, and as you said the 100 microsecond problem is not
      with us these days.

    - The RTT is reset after a timeout (from 0.1-0.15s to 0.7s in my tests).
      As expected.

    - With the defaults (retrans=3, timeo=0.7s), I see:

      After disconnecting the server, the client first times out after
      about 5.5-6 seconds.  First minor timeout is 0.1.  This makes sense
      as 0.7 << 3 == 5.6.

      Subsequent timeouts take about 10.5 seconds.  This also makes sense,
      as you have set the timeout threshold at 0.7*8 == 5.6 seconds,
      and three timeouts is 0.7*(1+2+4) == 4.9 seconds, too short.
      Four timeouts is 0.7*(1+2+4+8) == 10.5 seconds.

      The old behaviour before RTT estimation would have timed out
      after 10.5 seconds, I think.

    - With retrans=5, and timeo still has the default value of 0.7s:

      After disconnecting the server, the minor timeout intervals are
      approximately:

          0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2 seconds.

      Are they intended to stop doubling at 3.2?  The major timeout
      thus happens after 22.3 seconds.

      Unsurprisingly, subsequent major timeouts take 44.1 seconds.

So this patch is a big improvment, and I'm going to keep using it for my home
directory with retrans=5,soft so it gets some more background testing.
(retrans=3 is too short even with the patch).

However, there are potential improvements.  One is that the 3.2 above
should continue doubling.  The other is that behaviour would be nicer
if the major timeout time was more predictable: 22.3 to 44.1 seconds
is a big range.  This is easy with the algorithm described below.

It isn't possible to have remove the variation completely.  However,
it can easily by reduced by changing the doubling strategy: keep
doubling the retransmit time, until it exceeds timeo.  When that
happens, set the retransmit time to the next greater or equal value of
timeo << N for some integer N.

For example, with RTT at 0.1s, retrans=5, timeo=0.7, these would be
the minor timeout intervals:

    0.1, 0.2, 0.4, 0.7, 1.4, 2.8, 5.6, 11.2, 22.4

leading to a total major timeout time of 44.8 seconds.

Subsequent major timeouts, with the RTT reset to 0.7s, would take 44.1
seconds: 0.7, 1.4, 2.8, 5.6, 11.2, 22.4.

If the RTT estimator is larger than timeo to start with, the first
retransmit will timeout after RTT, but subsequent ones will be a value
of timeo << N.  E.g. if RTT was 2s, this would be the minor timeout
sequence: 2.0, 2.8, 5.6, 11.2, 22.4.

The algorithm for deciding when a major timeout occurs is different
too.  Instead of keeping track of the total time since the very first
transmission, you simply deem the major timeout to occur after the
minor timeout of timeo << retrans occurs.  I.e. in these examples, the
22.4s minor timeout is always the final one.

This reduces the possible variation, with these parameters, to the
range 44.1 to 45.325 seconds: much more consistent than 22.05 to 44.1
seconds.

As well as giving more consistent results, this might even be simpler
than the algorithm in your patch, because there is no need to remember
the total time since the first transmission.

-- Jamie

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18  3:26                 ` Jamie Lokier
@ 2004-04-18  7:03                   ` Trond Myklebust
  2004-04-18 23:22                     ` Jamie Lokier
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-18  7:03 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Sat, 2004-04-17 at 20:26, Jamie Lokier wrote:

>       Are they intended to stop doubling at 3.2?  The major timeout
>       thus happens after 22.3 seconds.
> 
>       Unsurprisingly, subsequent major timeouts take 44.1 seconds.

Right... ...but since the timeout value is already capped at 60 seconds,
this is not a major problem. It is pretty pointless to be talking about
"predictable" or "consistent" behaviour when talking about a situation
where we believe that the server has crashed.

AFAICS, all we care about is to establish a predictable *lower limit*.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18  7:03                   ` Trond Myklebust
@ 2004-04-18 23:22                     ` Jamie Lokier
  2004-04-19 15:38                       ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Jamie Lokier @ 2004-04-18 23:22 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jamie Lokier, linux-kernel

Trond Myklebust wrote:
> On Sat, 2004-04-17 at 20:26, Jamie Lokier wrote:
> >       Are they intended to stop doubling at 3.2?  The major timeout
> >       thus happens after 22.3 seconds.
> > 
> >       Unsurprisingly, subsequent major timeouts take 44.1 seconds.
> 
> Right... ...but since the timeout value is already capped at 60 seconds,
> this is not a major problem. It is pretty pointless to be talking about
> "predictable" or "consistent" behaviour when talking about a situation
> where we believe that the server has crashed.

I agree, but would still prefer more consistent behaviour if it is
easy -- and I explained how to do it, it's an easy algorithm.

You don't respond to the other question: the doubling stopping at
3.2s.  Is it intended?  It goes againt a basic principle of congestion
control.

> AFAICS, all we care about is to establish a predictable *lower limit*.

I agree that is the most important thing, and the old behaviour was
probably the cause of problems for at least one poster on this thread.

-- Jamie


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-18 23:22                     ` Jamie Lokier
@ 2004-04-19 15:38                       ` Trond Myklebust
  2004-04-19 16:19                         ` Trond Myklebust
  2004-04-20  0:09                         ` Jamie Lokier
  0 siblings, 2 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-19 15:38 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Sun, 2004-04-18 at 19:22, Jamie Lokier wrote:

> I agree, but would still prefer more consistent behaviour if it is
> easy -- and I explained how to do it, it's an easy algorithm.

The reason I don't like it is that it continues to tie the major timeout
to the resend timeouts. You've convinced me that they should not be the
same thing.

The other reason is that it only improves matters for the first request.
Once we reset the RTO, all those other outstanding requests are anyway
going to see an immediate discontinuity as their basic timeout jumps
from 1ms to 700ms. So why go to all that trouble just for 1 request?

> You don't respond to the other question: the doubling stopping at
> 3.2s.  Is it intended?  It goes againt a basic principle of congestion
> control.

I can put it back in.

It was partly another "consistency" issue that initially worried me,
partly in order to avoid problems with overflow:
If you have more than one outstanding request, then those that get
scheduled after the first major timeout (when we reset the RTO
estimator) will see a "jump". If the "retries" variable is too large,
they will either jump straight over 60 seconds, and thus trigger the cap
or they will end up at zero due to 32-bit overflow.

I agree, though, that this is less of an issue.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-19 15:38                       ` Trond Myklebust
@ 2004-04-19 16:19                         ` Trond Myklebust
  2004-04-20  0:09                         ` Jamie Lokier
  1 sibling, 0 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-19 16:19 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]

On Mon, 2004-04-19 at 11:38, Trond Myklebust wrote:
> On Sun, 2004-04-18 at 19:22, Jamie Lokier wrote:

> > You don't respond to the other question: the doubling stopping at
> > 3.2s.  Is it intended?  It goes againt a basic principle of congestion
> > control.
> 
> I can put it back in.

Here's a patch that continues doubling.

Cheers,
  Trond

[-- Attachment #2: Type: text/plain, Size: 9195 bytes --]

 include/linux/sunrpc/xprt.h    |   10 ++---
 net/sunrpc/auth_gss/auth_gss.c |    2 -
 net/sunrpc/clnt.c              |    4 --
 net/sunrpc/timer.c             |    1 
 net/sunrpc/xprt.c              |   81 +++++++++++++++++++++++++----------------
 5 files changed, 57 insertions(+), 41 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/include/linux/sunrpc/xprt.h linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h
--- linux-2.6.6-rc1/include/linux/sunrpc/xprt.h	2004-04-17 23:01:09.000000000 -0400
+++ linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h	2004-04-19 11:57:32.000000000 -0400
@@ -69,8 +69,7 @@ extern unsigned int xprt_tcp_slot_table_
  * This describes a timeout strategy
  */
 struct rpc_timeout {
-	unsigned long		to_current,		/* current timeout */
-				to_initval,		/* initial timeout */
+	unsigned long		to_initval,		/* initial timeout */
 				to_maxval,		/* max timeout */
 				to_increment;		/* if !exponential */
 	unsigned int		to_retries;		/* max # of retries */
@@ -85,7 +84,6 @@ struct rpc_rqst {
 	 * This is the user-visible part
 	 */
 	struct rpc_xprt *	rq_xprt;		/* RPC client */
-	struct rpc_timeout	rq_timeout;		/* timeout parms */
 	struct xdr_buf		rq_snd_buf;		/* send buffer */
 	struct xdr_buf		rq_rcv_buf;		/* recv buffer */
 
@@ -103,6 +101,9 @@ struct rpc_rqst {
 	struct xdr_buf		rq_private_buf;		/* The receive buffer
 							 * used in the softirq.
 							 */
+	unsigned long		rq_majortimeo;	/* major timeout alarm */
+	unsigned long		rq_timeout;	/* Current timeout value */
+	unsigned int		rq_retries;	/* # of retries */
 	/*
 	 * For authentication (e.g. auth_des)
 	 */
@@ -115,7 +116,6 @@ struct rpc_rqst {
 	u32			rq_bytes_sent;	/* Bytes we have sent */
 
 	unsigned long		rq_xtime;	/* when transmitted */
-	int			rq_ntimeo;
 	int			rq_ntrans;
 };
 #define rq_svec			rq_snd_buf.head
@@ -210,7 +210,7 @@ void			xprt_reserve(struct rpc_task *);
 int			xprt_prepare_transmit(struct rpc_task *);
 void			xprt_transmit(struct rpc_task *);
 void			xprt_receive(struct rpc_task *);
-int			xprt_adjust_timeout(struct rpc_timeout *);
+int			xprt_adjust_timeout(struct rpc_rqst *req);
 void			xprt_release(struct rpc_task *);
 void			xprt_connect(struct rpc_task *);
 int			xprt_clear_backlog(struct rpc_xprt *);
diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/auth_gss/auth_gss.c linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c
--- linux-2.6.6-rc1/net/sunrpc/auth_gss/auth_gss.c	2004-04-17 23:00:57.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c	2004-04-19 11:57:32.000000000 -0400
@@ -736,10 +736,8 @@ static int
 gss_refresh(struct rpc_task *task)
 {
 	struct rpc_clnt *clnt = task->tk_client;
-	struct rpc_xprt *xprt = task->tk_xprt;
 	struct rpc_cred *cred = task->tk_msg.rpc_cred;
 
-	task->tk_timeout = xprt->timeout.to_current;
 	if (!gss_cred_is_uptodate_ctx(cred))
 		return gss_upcall(clnt, task, cred);
 	return 0;
diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/clnt.c linux-2.6.6-01-soft/net/sunrpc/clnt.c
--- linux-2.6.6-rc1/net/sunrpc/clnt.c	2004-04-17 23:00:47.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/clnt.c	2004-04-19 11:57:32.000000000 -0400
@@ -788,13 +788,11 @@ static void
 call_timeout(struct rpc_task *task)
 {
 	struct rpc_clnt	*clnt = task->tk_client;
-	struct rpc_timeout *to = &task->tk_rqstp->rq_timeout;
 
-	if (xprt_adjust_timeout(to)) {
+	if (xprt_adjust_timeout(task->tk_rqstp) == 0) {
 		dprintk("RPC: %4d call_timeout (minor)\n", task->tk_pid);
 		goto retry;
 	}
-	to->to_retries = clnt->cl_timeout.to_retries;
 
 	dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid);
 	if (RPC_IS_SOFT(task)) {
diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/timer.c linux-2.6.6-01-soft/net/sunrpc/timer.c
--- linux-2.6.6-rc1/net/sunrpc/timer.c	2004-04-17 23:01:20.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/timer.c	2004-04-19 11:57:32.000000000 -0400
@@ -39,6 +39,7 @@ rpc_init_rtt(struct rpc_rtt *rt, unsigne
 	for (i = 0; i < 5; i++) {
 		rt->srtt[i] = init;
 		rt->sdrtt[i] = RPC_RTO_INIT;
+		rt->ntimeouts[i] = 0;
 	}
 }
 
diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/xprt.c linux-2.6.6-01-soft/net/sunrpc/xprt.c
--- linux-2.6.6-rc1/net/sunrpc/xprt.c	2004-04-17 23:01:07.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/xprt.c	2004-04-19 11:58:03.000000000 -0400
@@ -352,35 +352,57 @@ xprt_adjust_cwnd(struct rpc_xprt *xprt, 
 }
 
 /*
+ * Reset the major timeout value
+ */
+static void xprt_reset_majortimeo(struct rpc_rqst *req)
+{
+	struct rpc_timeout *to = &req->rq_xprt->timeout;
+
+	req->rq_majortimeo = req->rq_timeout;
+	if (to->to_exponential)
+		req->rq_majortimeo <<= to->to_retries;
+	else
+		req->rq_majortimeo += to->to_increment * to->to_retries;
+	if (req->rq_majortimeo > to->to_maxval || req->rq_majortimeo == 0)
+		req->rq_majortimeo = to->to_maxval;
+	req->rq_majortimeo += jiffies;
+}
+
+/*
  * Adjust timeout values etc for next retransmit
  */
-int
-xprt_adjust_timeout(struct rpc_timeout *to)
+int xprt_adjust_timeout(struct rpc_rqst *req)
 {
-	if (to->to_retries > 0) {
+	struct rpc_xprt *xprt = req->rq_xprt;
+	struct rpc_timeout *to = &xprt->timeout;
+	int status = 0;
+
+	if (time_before(jiffies, req->rq_majortimeo)) {
 		if (to->to_exponential)
-			to->to_current <<= 1;
+			req->rq_timeout <<= 1;
 		else
-			to->to_current += to->to_increment;
-		if (to->to_maxval && to->to_current >= to->to_maxval)
-			to->to_current = to->to_maxval;
+			req->rq_timeout += to->to_increment;
+		if (to->to_maxval && req->rq_timeout >= to->to_maxval)
+			req->rq_timeout = to->to_maxval;
+		req->rq_retries++;
+		pprintk("RPC: %lu retrans\n", jiffies);
 	} else {
-		if (to->to_exponential)
-			to->to_initval <<= 1;
-		else
-			to->to_initval += to->to_increment;
-		if (to->to_maxval && to->to_initval >= to->to_maxval)
-			to->to_initval = to->to_maxval;
-		to->to_current = to->to_initval;
+		req->rq_timeout = to->to_initval;
+		req->rq_retries = 0;
+		xprt_reset_majortimeo(req);
+		/* Reset the RTT counters == "slow start" */
+		spin_lock_bh(&xprt->sock_lock);
+		rpc_init_rtt(req->rq_task->tk_client->cl_rtt, to->to_initval);
+		spin_unlock_bh(&xprt->sock_lock);
+		pprintk("RPC: %lu timeout\n", jiffies);
+		status = -ETIMEDOUT;
 	}
 
-	if (!to->to_current) {
-		printk(KERN_WARNING "xprt_adjust_timeout: to_current = 0!\n");
-		to->to_current = 5 * HZ;
-	}
-	pprintk("RPC: %lu %s\n", jiffies,
-			to->to_retries? "retrans" : "timeout");
-	return to->to_retries-- > 0;
+	if (req->rq_timeout == 0) {
+		printk(KERN_WARNING "xprt_adjust_timeout: rq_timeout = 0!\n");
+		req->rq_timeout = 5 * HZ;
+	}
+	return status;
 }
 
 /*
@@ -1166,6 +1188,7 @@ xprt_transmit(struct rpc_task *task)
 			/* Add request to the receive list */
 			list_add_tail(&req->rq_list, &xprt->recv);
 			spin_unlock_bh(&xprt->sock_lock);
+			xprt_reset_majortimeo(req);
 		}
 	} else if (!req->rq_bytes_sent)
 		return;
@@ -1221,7 +1244,7 @@ xprt_transmit(struct rpc_task *task)
 			if (!xprt_connected(xprt))
 				task->tk_status = -ENOTCONN;
 			else if (test_bit(SOCK_NOSPACE, &xprt->sock->flags)) {
-				task->tk_timeout = req->rq_timeout.to_current;
+				task->tk_timeout = req->rq_timeout;
 				rpc_sleep_on(&xprt->pending, task, NULL, NULL);
 			}
 			spin_unlock_bh(&xprt->sock_lock);
@@ -1248,13 +1271,11 @@ xprt_transmit(struct rpc_task *task)
 	if (!xprt->nocong) {
 		int timer = task->tk_msg.rpc_proc->p_timer;
 		task->tk_timeout = rpc_calc_rto(clnt->cl_rtt, timer);
-		task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer);
-		task->tk_timeout <<= clnt->cl_timeout.to_retries
-			- req->rq_timeout.to_retries;
-		if (task->tk_timeout > req->rq_timeout.to_maxval)
-			task->tk_timeout = req->rq_timeout.to_maxval;
+		task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer) + req->rq_retries;
+		if (task->tk_timeout > xprt->timeout.to_maxval || task->tk_timeout == 0)
+			task->tk_timeout = xprt->timeout.to_maxval;
 	} else
-		task->tk_timeout = req->rq_timeout.to_current;
+		task->tk_timeout = req->rq_timeout;
 	/* Don't race with disconnect */
 	if (!xprt_connected(xprt))
 		task->tk_status = -ENOTCONN;
@@ -1324,7 +1345,7 @@ xprt_request_init(struct rpc_task *task,
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
 
-	req->rq_timeout = xprt->timeout;
+	req->rq_timeout = xprt->timeout.to_initval;
 	req->rq_task	= task;
 	req->rq_xprt    = xprt;
 	req->rq_xid     = xprt_alloc_xid(xprt);
@@ -1381,7 +1402,6 @@ xprt_default_timeout(struct rpc_timeout 
 void
 xprt_set_timeout(struct rpc_timeout *to, unsigned int retr, unsigned long incr)
 {
-	to->to_current   = 
 	to->to_initval   = 
 	to->to_increment = incr;
 	to->to_maxval    = incr * retr;
@@ -1446,7 +1466,6 @@ xprt_setup(int proto, struct sockaddr_in
 	/* Set timeout parameters */
 	if (to) {
 		xprt->timeout = *to;
-		xprt->timeout.to_current = to->to_initval;
 	} else
 		xprt_default_timeout(&xprt->timeout, xprt->prot);
 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-19 15:38                       ` Trond Myklebust
  2004-04-19 16:19                         ` Trond Myklebust
@ 2004-04-20  0:09                         ` Jamie Lokier
  1 sibling, 0 replies; 54+ messages in thread
From: Jamie Lokier @ 2004-04-20  0:09 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

Trond Myklebust wrote:
> > I agree, but would still prefer more consistent behaviour if it is
> > easy -- and I explained how to do it, it's an easy algorithm.
> 
> The reason I don't like it is that it continues to tie the major timeout
> to the resend timeouts. You've convinced me that they should not be the
> same thing.

Sorry, I don't understand that paragraph.

The algorithm I suggested _decouples_ the major timeout from the rtt
estimate.  Your algorithm strongly couples them.  I'm not sure what
you mean by saying the major timeout is "tied to the resend timeouts".

Your current (patched) algorithm sets the major timeout to be in the
range:

     [timeo << retrans, (timeo << retrans) * 2]

The suggested algorithm sets the major timeout to be in the range:

     [timeo << (retrans+1), (timeo << (retrans+1)) + 2 * timeo)

I.e. with retrans set to a new default of 5 (I think that's useful),
the major timeout is approx [44.8, 46] instead of [22.4, 44.8].

I agree it's not the most important thing in the world, but it is nice
to be able to fix the parameters and say that with the defaults, major
timeout happens after about 45 seconds.

You say you don't like it because major timeout is still tied to
something.  Could you explain what the ideal behaviour you have in
mind is?  Right now, with the patch, I think your intention is to have
a fixed major timeout time, but it doesn't work like that.

> The other reason is that it only improves matters for the first request.
> Once we reset the RTO, all those other outstanding requests are anyway
> going to see an immediate discontinuity as their basic timeout jumps
> from 1ms to 700ms.

Yes, that's the point: after a retransmits passes a threshold, we
should no longer depend on the RTO estimate because it doesn't seem to
be reliable.

> So why go to all that trouble just for 1 request?

Because it's visible behaviour with "soft" mounts.  Someone unplugs
the cable or the network is down, and you see the I/O errors after
about 40 seconds.  This is nicer than seeing them after an unknown
period between 40 and 80 (or 20 and 40 depending on your settings).

> It was partly another "consistency" issue that initially worried me,
> partly in order to avoid problems with overflow:
> If you have more than one outstanding request, then those that get
> scheduled after the first major timeout (when we reset the RTO
> estimator) will see a "jump". If the "retries" variable is too large,
> they will either jump straight over 60 seconds, and thus trigger the cap
> or they will end up at zero due to 32-bit overflow.

Ah.  So you keep track of the number of retries per request, and each
time you send a request you set its timeout to (RTO << retries)?

If you do, maybe that's why my algorithm seems over complicated, and
you're concerned about overflows etc.

Instead of counting retries, don't.  You don't need a per-request
retries counter.  Instead: keep track of the request_timeout when the
request was last issued.  When retransmitting, compare that value
against the global value (timeo << retrans).  When a request times out
and request_timeout >= (timeo << retrans), that's a major timeout.
Otherwise you just check if request_timeout < timeo.  If yes, double
it.  If no, set request_timeout = timeo << N for the smallest integer
N such that it's an increase.  And try again.

Notice how that logic is based on constants: it's independent of RTO,
and so outstanding requests aren't affected by changes in RTO.
There's no jump, no overflow, and you can compute the key constant
(timeo << retrans) when initialising: retrans isn't used by itself.

-- Jamie

^ permalink raw reply	[flat|nested] 54+ messages in thread

[parent not found: <20040416190126.GB408@widomaker.com>]

[parent not found: <1082144608.2581.156.camel@lade.trondhjem.org>]

[parent not found: <20040417000353.GA3750@widomaker.com>]

* Re: NFS and kernel 2.6.x
       [not found]       ` <20040417000353.GA3750@widomaker.com>
@ 2004-04-17  5:28         ` Trond Myklebust
  2004-04-17 17:55           ` Charles Shannon Hendrix
  0 siblings, 1 reply; 54+ messages in thread
From: Trond Myklebust @ 2004-04-17  5:28 UTC (permalink / raw)
  To: Charles Shannon Hendrix; +Cc: linux-kernel

On Fri, 2004-04-16 at 17:03, Charles Shannon Hendrix wrote:
> > 
> > 2.6.x can cache a lot more data, and will tend to write it out in a more
> > lazy fashion (i.e. only when the user requests it). That means the
> > writes tend to occur in a more bursty fashion.
> 
> That makes sense.
> 
> Was there a specific reason for making NFS traffic bursty, or did it
> just work out that way?

It's an inevitable side-effect of the increased caching. If you are
constantly writing out data, then you spread out the load a lot more
than if you wait until the user actually requests a flush.
On the other hand, it means that if your application reads/writs several
times over the same page, then you only write it out once.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17  5:28         ` Trond Myklebust
@ 2004-04-17 17:55           ` Charles Shannon Hendrix
  2004-04-17 18:55             ` Trond Myklebust
  0 siblings, 1 reply; 54+ messages in thread
From: Charles Shannon Hendrix @ 2004-04-17 17:55 UTC (permalink / raw)
  To: Linux Kernel

Fri, 16 Apr 2004 @ 22:28 -0700, Trond Myklebust said:

> It's an inevitable side-effect of the increased caching. 

OK.  That answers my question of: was making NFS bursty done on purpose.
Answer: no.

> If you are constantly writing out data, then you spread out the load
> a lot more than if you wait until the user actually requests a flush.
> On the other hand, it means that if your application reads/writs
> several times over the same page, then you only write it out once.

Usually, eliminating redundant writes in your application is a better
optimization than relying on the OS to do it for you.

I find bursty I/O is less desirable in most cases.


-- 
shannon "AT" widomaker.com -- ["The trade of governing has always been
monopolized by the most ignorant and the most rascally individuals of
mankind.  -- Thomas Paine"]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: NFS and kernel 2.6.x
  2004-04-17 17:55           ` Charles Shannon Hendrix
@ 2004-04-17 18:55             ` Trond Myklebust
  0 siblings, 0 replies; 54+ messages in thread
From: Trond Myklebust @ 2004-04-17 18:55 UTC (permalink / raw)
  To: Charles Shannon Hendrix; +Cc: Linux Kernel

On Sat, 2004-04-17 at 10:55, Charles Shannon Hendrix wrote:
> Usually, eliminating redundant writes in your application is a better
> optimization than relying on the OS to do it for you.

Fine. As long as you can convince all the other people sharing the same
page cache to do so too. We're not talking about single applications
here...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 54+ messages in thread

[parent not found: <1Lql8-6O3-1@gated-at.bofh.it>]

[parent not found: <1LquO-6TK-5@gated-at.bofh.it>]

[parent not found: <1LqOg-76p-19@gated-at.bofh.it>]

[parent not found: <1LrKo-7Sn-21@gated-at.bofh.it>]

[parent not found: <1LtM3-12d-5@gated-at.bofh.it>]

[parent not found: <1Luf2-1kK-1@gated-at.bofh.it>]

[parent not found: <1LDBL-uY-3@gated-at.bofh.it>]

* Re: NFS and kernel 2.6.x
       [not found]           ` <1LDBL-uY-3@gated-at.bofh.it>
@ 2004-04-16 20:31             ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2004-04-16 20:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

Marcelo Tosatti <marcelo.tosatti@cyclades.com> writes:
>
> Maaybe TCP should be the default then ? In case no one finds the reason 
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in 
> theory?

Problem is that older linux knfsd  (early 2.4) tend to crash or hang
after some time when they have to talk TCP. But I guess it would
be still a better default ... 

-Andi


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2004-04-20  0:09 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-16  1:14 NFS and kernel 2.6.x Charles Shannon Hendrix
2004-04-16  1:31 ` Trond Myklebust
2004-04-16  1:53   ` Andrew Morton
2004-04-16  2:54     ` Trond Myklebust
2004-04-16  4:59       ` Phil Oester
2004-04-16  5:29         ` Trond Myklebust
2004-04-16  7:13           ` Paul Wagland
2004-04-16 14:44           ` Marcelo Tosatti
2004-04-16 14:46             ` Marcelo Tosatti
2004-04-16 15:50             ` Trond Myklebust
2004-04-16 15:55             ` Dave Gilbert (Home)
2004-04-16 16:13               ` Trond Myklebust
2004-04-16 19:07                 ` Daniel Egger
2004-04-17  4:56                   ` Chris Friesen
2004-04-17  9:56                     ` Daniel Egger
2004-04-17  5:24                   ` Trond Myklebust
2004-04-17 14:15                     ` Daniel Egger
2004-04-16 19:11                 ` Charles Shannon Hendrix
2004-04-17 16:44           ` Matthias Urlichs
2004-04-17 18:15             ` Trond Myklebust
2004-04-17 18:32               ` Marc Singer
2004-04-17 18:58                 ` Trond Myklebust
2004-04-17 19:01                   ` Marc Singer
2004-04-17 19:09                     ` Trond Myklebust
2004-04-17 19:19                       ` Russell King
2004-04-18  2:51                         ` Trond Myklebust
2004-04-19 16:39                           ` Trond Myklebust
2004-04-19 21:10                             ` Trond Myklebust
2004-04-17 22:22                   ` Marc Singer
2004-04-18  0:57                     ` Trond Myklebust
2004-04-18  5:01                       ` Marc Singer
2004-04-18  6:36                         ` Chris Friesen
2004-04-18  7:56                           ` Russell King
2004-04-18 17:31                             ` Marc Singer
2004-04-17 19:01                 ` Daniel Egger
2004-04-17 20:22                   ` Marc Singer
2004-04-18 11:14                     ` Daniel Egger
2004-04-19  9:06             ` Helge Hafting
2004-04-16  9:03     ` Jamie Lokier
2004-04-16 15:55       ` Trond Myklebust
2004-04-16 18:48         ` Jamie Lokier
2004-04-16 19:06           ` Trond Myklebust
2004-04-16 19:39             ` Jamie Lokier
2004-04-17 22:32               ` Trond Myklebust
2004-04-18  3:26                 ` Jamie Lokier
2004-04-18  7:03                   ` Trond Myklebust
2004-04-18 23:22                     ` Jamie Lokier
2004-04-19 15:38                       ` Trond Myklebust
2004-04-19 16:19                         ` Trond Myklebust
2004-04-20  0:09                         ` Jamie Lokier
     [not found]   ` <20040416190126.GB408@widomaker.com>
     [not found]     ` <1082144608.2581.156.camel@lade.trondhjem.org>
     [not found]       ` <20040417000353.GA3750@widomaker.com>
2004-04-17  5:28         ` Trond Myklebust
2004-04-17 17:55           ` Charles Shannon Hendrix
2004-04-17 18:55             ` Trond Myklebust
     [not found] <1Lql8-6O3-1@gated-at.bofh.it>
     [not found] ` <1LquO-6TK-5@gated-at.bofh.it>
     [not found]   ` <1LqOg-76p-19@gated-at.bofh.it>
     [not found]     ` <1LrKo-7Sn-21@gated-at.bofh.it>
     [not found]       ` <1LtM3-12d-5@gated-at.bofh.it>
     [not found]         ` <1Luf2-1kK-1@gated-at.bofh.it>
     [not found]           ` <1LDBL-uY-3@gated-at.bofh.it>
2004-04-16 20:31             ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox