* NFS and kernel 2.6.x
@ 2004-04-16 1:14 Charles Shannon Hendrix
2004-04-16 1:31 ` Trond Myklebust
0 siblings, 1 reply; 54+ messages in thread
From: Charles Shannon Hendrix @ 2004-04-16 1:14 UTC (permalink / raw)
To: Linux Kernel
I'm having a hard time right now with NFS on kernel 2.6.
I tried to search archives but can't find much on my exact problem. If
I missed something good, a pointer would be great.
Anyway, the problem: NFS writes are broken in 2.6 on my machine.
I normally mount several volumes from a Sun SS5 running NetBSD.
It's worked great for years, and usually is not too bad on speed.
When I moved to Linux kernel 2.6.1, writes to the NetBSD server got
incredibly slow. Like it went from around 600K/sec to just a few K/sec
to maybe 25K/sec.
By contrast, rsync runs at around 900K/sec or faster, close to wire
speed (yes, raw speed, not compressed speed).
With kernels 2.6.3 and 2.6.5, it doesn't work at all. If I do something
like this:
% cp bigfile /public
It just hangs. After that umounts or even reads of that volume hang.
They can be killed, but not always. Gnome's Nautilus for example gets
permanently hung, though that might be its own issue.
Offhand, I cannot remember what NFS write performance was with Linux
kernel 2.4, but it was several hundred K/sec unless the server was
loaded.
Reading from the NFS server seems to still be fine. For example, just
now I copied a file from there at around 660K/sec using kernel 2.6.5
on the client.
Anyway, I would like to explore this further and solve the problem.
Details on my setup:
NFS server:
Sun SS5
10baseT ethernet (100baseT card available, not used)
NetBSD 1.6.1
pretty much a plain vanilla server setup
Network:
simple LAN with three machines, connected via a full duplex
multi-speed switch
NFS client:
vanilla PC
Intel Pro/100 ethernet
Slackware 9.1
Linux kernel 2.6.5, plain with no mods or patches, only enough
drivers and features enabled to run my workstation
configuration as close as I could get to my Linux 2.4
kernel
--
shannon "AT" widomaker.com -- ["All of us get lost in the darkness,
dreamers turn to look at the stars" -- Rush ]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: NFS and kernel 2.6.x 2004-04-16 1:14 NFS and kernel 2.6.x Charles Shannon Hendrix @ 2004-04-16 1:31 ` Trond Myklebust 2004-04-16 1:53 ` Andrew Morton [not found] ` <20040416190126.GB408@widomaker.com> 0 siblings, 2 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-16 1:31 UTC (permalink / raw) To: Charles Shannon Hendrix; +Cc: Linux Kernel På to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix: > > NFS server: > > Sun SS5 > 10baseT ethernet (100baseT card available, not used) > NetBSD 1.6.1 > pretty much a plain vanilla server setup > > Network: > > simple LAN with three machines, connected via a full duplex > multi-speed switch > > NFS client: > > vanilla PC > Intel Pro/100 ethernet > Slackware 9.1 > Linux kernel 2.6.5, plain with no mods or patches, only enough > drivers and features enabled to run my workstation > configuration as close as I could get to my Linux 2.4 > kernel This is pretty much covered in the NFS FAQ entry B10. You are experiencing the classical effects of using unreliable transport (i.e. UDP) on a mixed speed network. Writes to the server are getting lost, because it is on a slow segment that cannot keep up with the faster 100Mbit clients. Use the 'proto=tcp' mount option, and all will be well again. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 1:31 ` Trond Myklebust @ 2004-04-16 1:53 ` Andrew Morton 2004-04-16 2:54 ` Trond Myklebust 2004-04-16 9:03 ` Jamie Lokier [not found] ` <20040416190126.GB408@widomaker.com> 1 sibling, 2 replies; 54+ messages in thread From: Andrew Morton @ 2004-04-16 1:53 UTC (permalink / raw) To: Trond Myklebust; +Cc: shannon, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > På to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix: > > > > > NFS server: > > > > Sun SS5 > > 10baseT ethernet (100baseT card available, not used) > > NetBSD 1.6.1 > > pretty much a plain vanilla server setup > > > > Network: > > > > simple LAN with three machines, connected via a full duplex > > multi-speed switch > > > > NFS client: > > > > vanilla PC > > Intel Pro/100 ethernet > > Slackware 9.1 > > Linux kernel 2.6.5, plain with no mods or patches, only enough > > drivers and features enabled to run my workstation > > configuration as close as I could get to my Linux 2.4 > > kernel > > This is pretty much covered in the NFS FAQ entry B10. > > You are experiencing the classical effects of using unreliable transport > (i.e. UDP) on a mixed speed network. Writes to the server are getting > lost, because it is on a slow segment that cannot keep up with the > faster 100Mbit clients. But Charles was seeing good performance with 2.4-based clients. When he went to 2.6 everything fell apart. Do we know why this regression occurred? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 1:53 ` Andrew Morton @ 2004-04-16 2:54 ` Trond Myklebust 2004-04-16 4:59 ` Phil Oester 2004-04-16 9:03 ` Jamie Lokier 1 sibling, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-16 2:54 UTC (permalink / raw) To: Andrew Morton; +Cc: shannon, linux-kernel På to , 15/04/2004 klokka 18:53, skreiv Andrew Morton: > But Charles was seeing good performance with 2.4-based clients. When he > went to 2.6 everything fell apart. > > Do we know why this regression occurred? What regression??? You have a statistic of 1 person whose 3 clients changed from what was an apparently working setup to what has *always* been the usual scenario for most people that tried to use the same broken hardware/software combination whether it be in 2.2.x, 2.4.x or 2.6.x. The whole problem is that UDP provides unreliable transport... It offers NO guarantees that the packet will arrive at the destination. If only 1 fragment out of the 22 that it takes to send a single wsize=32k write request to the Sun server gets lost on the way, the Sun's networking layer will ignore that entire packet, and so the whole write has to time out and get resent. Switches can usually cache a few fragments if the clients on the 100Mbit network are sending requests at a rate that almost matches the 10Mbit bandwidth that the Sun server supports, but if the network is swamped so that the switch runs out of cache, then it will start to drop packets. This is the whole reason why Sun set TCP to be their default mount option when the changed their servers to use 32k read/write. My biggest suspect for why this particular setup changed in 2.6.x would therefore be the changes to the way in which writes are scheduled on the wire. We cache them for longer, and so overall the bandwidth usage goes down, but at the expense of more "burstiness" when the user closes the file or does some other fsync()-like operation. So in fact you have 2 possible workarounds: - Use the TCP mount option (by far the better option, since TCP *does* provide reliable transport). - Keep UDP, but use the wsize mount option to explicitly override the server's choice of write sizes. That works by reducing the number of fragments per write, and so improving performance by reducing the amount of data that need to be resent per fragment lost. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 2:54 ` Trond Myklebust @ 2004-04-16 4:59 ` Phil Oester 2004-04-16 5:29 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Phil Oester @ 2004-04-16 4:59 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, shannon, linux-kernel Actually I can concur -- I recently migrated 100+ servers from 2.4.x to 2.6.3, and simply could not use UDP mounts and achieve acceptable performance. Further, I wasn't using 32K r/w as you posit, but was using 8K (against a NetApp FWIW). If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable, perhaps this should be documented -- or the option should be deprecated. Phil Oester On Thu, Apr 15, 2004 at 07:54:08PM -0700, Trond Myklebust wrote: > På to , 15/04/2004 klokka 18:53, skreiv Andrew Morton: > > But Charles was seeing good performance with 2.4-based clients. When he > > went to 2.6 everything fell apart. > > > > Do we know why this regression occurred? > > What regression??? You have a statistic of 1 person whose 3 clients ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 4:59 ` Phil Oester @ 2004-04-16 5:29 ` Trond Myklebust 2004-04-16 7:13 ` Paul Wagland ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-16 5:29 UTC (permalink / raw) To: Phil Oester; +Cc: Andrew Morton, shannon, linux-kernel På to , 15/04/2004 klokka 21:59, skreiv Phil Oester: > If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable, > perhaps this should be documented -- or the option should be deprecated. Put simply: I am not interested in wasting _my_ time investigating cases where UDP is performing badly if TCP is working fine. The variable reliability issues with UDP are precisely why we worked to get the TCP stuff working efficiently. As for blanket statements like the above: I have seen no evidence yet that they are any more warranted in 2.6.x than they were in 2.4.x. At least not as long as I continue to see wire speed performance on reads and writes on UDP on all my own test setups. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 5:29 ` Trond Myklebust @ 2004-04-16 7:13 ` Paul Wagland 2004-04-16 14:44 ` Marcelo Tosatti 2004-04-17 16:44 ` Matthias Urlichs 2 siblings, 0 replies; 54+ messages in thread From: Paul Wagland @ 2004-04-16 7:13 UTC (permalink / raw) To: Trond Myklebust; +Cc: shannon, Phil Oester, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 734 bytes --] On Apr 16, 2004, at 7:29, Trond Myklebust wrote: > På to , 15/04/2004 klokka 21:59, skreiv Phil Oester: > >> If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts >> unusable, >> perhaps this should be documented -- or the option should be >> deprecated. > > As for blanket statements like the above: I have seen no evidence yet > that they are any more warranted in 2.6.x than they were in 2.4.x. At > least not as long as I continue to see wire speed performance on reads > and writes on UDP on all my own test setups. Just as an aside, I can confirm this as well... we use UDP mounts, and get a pretty constant 10MB/s (assuming people aren't running bloody xscreensavers!*!) Cheers, Paul [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 186 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 5:29 ` Trond Myklebust 2004-04-16 7:13 ` Paul Wagland @ 2004-04-16 14:44 ` Marcelo Tosatti 2004-04-16 14:46 ` Marcelo Tosatti ` (2 more replies) 2004-04-17 16:44 ` Matthias Urlichs 2 siblings, 3 replies; 54+ messages in thread From: Marcelo Tosatti @ 2004-04-16 14:44 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-kernel, Andrew Morton, shannon, Phil Oester On Thu, Apr 15, 2004 at 10:29:06PM -0700, Trond Myklebust wrote: > På to , 15/04/2004 klokka 21:59, skreiv Phil Oester: > > > If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable, > > perhaps this should be documented -- or the option should be deprecated. > > Put simply: I am not interested in wasting _my_ time investigating cases > where UDP is performing badly if TCP is working fine. The variable > reliability issues with UDP are precisely why we worked to get the TCP > stuff working efficiently. > > As for blanket statements like the above: I have seen no evidence yet > that they are any more warranted in 2.6.x than they were in 2.4.x. At > least not as long as I continue to see wire speed performance on reads > and writes on UDP on all my own test setups. Maaybe TCP should be the default then ? In case no one finds the reason why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in theory? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 14:44 ` Marcelo Tosatti @ 2004-04-16 14:46 ` Marcelo Tosatti 2004-04-16 15:50 ` Trond Myklebust 2004-04-16 15:55 ` Dave Gilbert (Home) 2 siblings, 0 replies; 54+ messages in thread From: Marcelo Tosatti @ 2004-04-16 14:46 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-kernel, Andrew Morton, shannon, Phil Oester On Fri, Apr 16, 2004 at 11:44:33AM -0300, Marcelo Tosatti wrote: > On Thu, Apr 15, 2004 at 10:29:06PM -0700, Trond Myklebust wrote: > > På to , 15/04/2004 klokka 21:59, skreiv Phil Oester: > > > > > If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable, > > > perhaps this should be documented -- or the option should be deprecated. > > > > Put simply: I am not interested in wasting _my_ time investigating cases > > where UDP is performing badly if TCP is working fine. The variable > > reliability issues with UDP are precisely why we worked to get the TCP > > stuff working efficiently. > > > > As for blanket statements like the above: I have seen no evidence yet > > that they are any more warranted in 2.6.x than they were in 2.4.x. At > > least not as long as I continue to see wire speed performance on reads > > and writes on UDP on all my own test setups. > > Maaybe TCP should be the default then ? Or just make a big warning in the Kconfig. Distros will set it to the default... > In case no one finds the reason > why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are > quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in > theory? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 14:44 ` Marcelo Tosatti 2004-04-16 14:46 ` Marcelo Tosatti @ 2004-04-16 15:50 ` Trond Myklebust 2004-04-16 15:55 ` Dave Gilbert (Home) 2 siblings, 0 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-16 15:50 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel, Andrew Morton, shannon, Phil Oester On Fri, 2004-04-16 at 07:44, Marcelo Tosatti wrote: > Maaybe TCP should be the default then ? In case no one finds the reason > why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are > quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in > theory? Are you talking about the TCP server configuration option here, or the TCP mount option? IMO both should be default. I've got a patch for the "mount" program, which I've been intending to send on to Andries (I've just been too busy for the past few weeks to give it a last review). Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 14:44 ` Marcelo Tosatti 2004-04-16 14:46 ` Marcelo Tosatti 2004-04-16 15:50 ` Trond Myklebust @ 2004-04-16 15:55 ` Dave Gilbert (Home) 2004-04-16 16:13 ` Trond Myklebust 2 siblings, 1 reply; 54+ messages in thread From: Dave Gilbert (Home) @ 2004-04-16 15:55 UTC (permalink / raw) To: Marcelo Tosatti Cc: Trond Myklebust, linux-kernel, Andrew Morton, shannon, Phil Oester Marcelo Tosatti wrote: > Maaybe TCP should be the default then ? In case no one finds the reason > why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are > quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in > theory? While it is reasonable to make TCP default it is important that if there is a real problem with UDP NFS that it is sorted. Some of us have to work with older machines and kernels on clients that don't support TCP NFS. Dave ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 15:55 ` Dave Gilbert (Home) @ 2004-04-16 16:13 ` Trond Myklebust 2004-04-16 19:07 ` Daniel Egger 2004-04-16 19:11 ` Charles Shannon Hendrix 0 siblings, 2 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-16 16:13 UTC (permalink / raw) To: Dave Gilbert (Home) Cc: Marcelo Tosatti, linux-kernel, Andrew Morton, shannon, Phil Oester On Fri, 2004-04-16 at 08:55, Dave Gilbert (Home) wrote: > While it is reasonable to make TCP default it is important that if there > is a real problem with UDP NFS that it is sorted. Some of us have to > work with older machines and kernels on clients that don't support TCP NFS. Then "some of you" can send in a proper bugreport in the usual format if and when that problem actually occurs. So far I have NOTHING to tell me there is a problem here. Just a load of people going ballistic over hot air.... ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 16:13 ` Trond Myklebust @ 2004-04-16 19:07 ` Daniel Egger 2004-04-17 4:56 ` Chris Friesen 2004-04-17 5:24 ` Trond Myklebust 2004-04-16 19:11 ` Charles Shannon Hendrix 1 sibling, 2 replies; 54+ messages in thread From: Daniel Egger @ 2004-04-16 19:07 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel [-- Attachment #1.1: Type: text/plain, Size: 998 bytes --] On 16.04.2004, at 18:13, Trond Myklebust wrote: > Then "some of you" can send in a proper bugreport in the usual format > if > and when that problem actually occurs. > So far I have NOTHING to tell me there is a problem here. Just a load > of > people going ballistic over hot air.... Great you want to help here. So I've a system which is NFS root using a 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the water somewhere in between 10 seconds and 5 minutes after boot using NFS over UDP. The last thing I see are 3 or 4 messages of the type: server 192.168.11.2 not responding, still trying NFS seems to work better with 2.6.4 which unfortuntely has other nasty bugs for me; currently I'm running 2.4.26 which works fine, over both UDP and TCP. Preempt is off as are the NFS features which I do not trust yet (v4 and direct IO). Attached is the config for your viewing pleasure. Please tell me how I can help here and I'll certainly do it. Servus, Daniel [-- Attachment #1.2: config.gz --] [-- Type: application/x-gzip, Size: 8138 bytes --] [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 478 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 19:07 ` Daniel Egger @ 2004-04-17 4:56 ` Chris Friesen 2004-04-17 9:56 ` Daniel Egger 2004-04-17 5:24 ` Trond Myklebust 1 sibling, 1 reply; 54+ messages in thread From: Chris Friesen @ 2004-04-17 4:56 UTC (permalink / raw) To: Daniel Egger; +Cc: Trond Myklebust, Linux Kernel Daniel Egger wrote: > Great you want to help here. So I've a system which is NFS root using a > 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the > water somewhere in between 10 seconds and 5 minutes after boot using > NFS over UDP. The last thing I see are 3 or 4 messages of the type: If this is an issue, it might make sense to have root be a tmpfs filesystem, and then have specific network mounts. Note--don't make "/var/log" network mounted, various apps default to trying to check for files there--if the server goes away, you can't log in/out. Chris ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 4:56 ` Chris Friesen @ 2004-04-17 9:56 ` Daniel Egger 0 siblings, 0 replies; 54+ messages in thread From: Daniel Egger @ 2004-04-17 9:56 UTC (permalink / raw) To: Chris Friesen; +Cc: Linux Kernel, Trond Myklebust [-- Attachment #1: Type: text/plain, Size: 1853 bytes --] On 17.04.2004, at 06:56, Chris Friesen wrote: >> Great you want to help here. So I've a system which is NFS root using >> a >> 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in >> the >> water somewhere in between 10 seconds and 5 minutes after boot using >> NFS over UDP. The last thing I see are 3 or 4 messages of the type: > > If this is an issue, it might make sense to have root be a tmpfs > filesystem, > and then have specific network mounts. I'm trying to keep this a standard Debian system as much as possible. Also I've several machines having a large number of shared partitions, some of them fulfill different purposes, so I would need to customize several instances which sounds like much work to me; part of it certainly unnecessary because it works just fine with older kernels... :) Also there is the issue that the only thing that is sort of guaranteed to be transported over the network is the kernel itself. Sometimes it hangs already when or just after loading init. I'm not convinced it will be always able to transfer the whole ramdisk.... Forgot to mention: I've also seen segfaults and wrong file contents in random places while init executes the scripts in /etc/rc*.d but those seem to have gone away after I used a more conservative set of kernel config options. Now it'll only hang. > Note--don't make "/var/log" network mounted, various apps default to > trying to check for files there--if the server goes away, you can't > log in/out. There's unfortunately more to this. I also cannot log in if any of the files (bash, bashrc, profiles, libraries, etc.) needed for login are on nfs. The question here is what is more reliable in terms of data transfer after an Oops: NFS or syslogd (UDP). So far I'm satisfied with NFS here, so I don't see a good reason to change. Servus, Daniel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 478 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 19:07 ` Daniel Egger 2004-04-17 4:56 ` Chris Friesen @ 2004-04-17 5:24 ` Trond Myklebust 2004-04-17 14:15 ` Daniel Egger 1 sibling, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-17 5:24 UTC (permalink / raw) To: Daniel Egger; +Cc: Linux Kernel On Fri, 2004-04-16 at 12:07, Daniel Egger wrote: > Great you want to help here. So I've a system which is NFS root using a > 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the > water somewhere in between 10 seconds and 5 minutes after boot using > NFS over UDP. The last thing I see are 3 or 4 messages of the type: ...and if you use TCP? > server 192.168.11.2 not responding, still trying The other thing I'd need is a tcpdump. Something like "tcpdump -s 9000 -w dump.out"... Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 5:24 ` Trond Myklebust @ 2004-04-17 14:15 ` Daniel Egger 0 siblings, 0 replies; 54+ messages in thread From: Daniel Egger @ 2004-04-17 14:15 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel [-- Attachment #1: Type: text/plain, Size: 1100 bytes --] On Sat, 2004-04-17 at 07:24, Trond Myklebust wrote: > > Great you want to help here. So I've a system which is NFS root using a > > 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the > > water somewhere in between 10 seconds and 5 minutes after boot using > > NFS over UDP. The last thing I see are 3 or 4 messages of the type: > ...and if you use TCP? My bad, I got confused; with TCP I get the hangs, with UDP the data corruption. Unfortunately it doesn't want to hang for me me right now. :( ... > > server 192.168.11.2 not responding, still trying > The other thing I'd need is a tcpdump. Something like "tcpdump -s 9000 > -w dump.out"... but I have two different tasty cases of data corruption using NFS over UDP traced for you which I'll send you in private. The first one corrupts init so that it segfaults, the second one probably crashes the rc starter to that I'm left with an unusable getty login on console. I'll try to get the TCP problems traced as well but right now I don't have the time to wait.... -- Servus, Daniel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 481 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 16:13 ` Trond Myklebust 2004-04-16 19:07 ` Daniel Egger @ 2004-04-16 19:11 ` Charles Shannon Hendrix 1 sibling, 0 replies; 54+ messages in thread From: Charles Shannon Hendrix @ 2004-04-16 19:11 UTC (permalink / raw) To: Linux Kernel Fri, 16 Apr 2004 @ 09:13 -0700, Trond Myklebust said: > Then "some of you" can send in a proper bugreport in the usual format if > and when that problem actually occurs. > > So far I have NOTHING to tell me there is a problem here. Just a load of > people going ballistic over hot air.... Several people are reporting a problem and discussing it, but I don't see any of them going ballistic. -- shannon "AT" widomaker.com -- ["Secrecy is the beginning of tyranny." -- Unknown] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 5:29 ` Trond Myklebust 2004-04-16 7:13 ` Paul Wagland 2004-04-16 14:44 ` Marcelo Tosatti @ 2004-04-17 16:44 ` Matthias Urlichs 2004-04-17 18:15 ` Trond Myklebust 2004-04-19 9:06 ` Helge Hafting 2 siblings, 2 replies; 54+ messages in thread From: Matthias Urlichs @ 2004-04-17 16:44 UTC (permalink / raw) To: linux-kernel Hi, Trond Myklebust wrote: > As for blanket statements like the above: I have seen no evidence yet > that they are any more warranted in 2.6.x than they were in 2.4.x. Oh, I saw the problem too: a slow client couldn't do full-size reads from a fast server because the buffer on the client's network card was just 8k. Granted that the client is a slow m68k Mac, but 2.4 was fast enough to get the first packet entirely off the card before the last one overruns the buffer -- while 2.6 has a bit more latency, so it can't. Apparently that bit of increased latency is offset by the fact that the machine still limps along if I packet-bomb it. Under 2.4 it locked solid, so overall I think that the 2.6 situation is an improvement. -- Matthias Urlichs ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 16:44 ` Matthias Urlichs @ 2004-04-17 18:15 ` Trond Myklebust 2004-04-17 18:32 ` Marc Singer 2004-04-19 9:06 ` Helge Hafting 1 sibling, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-17 18:15 UTC (permalink / raw) To: Matthias Urlichs; +Cc: linux-kernel On Sat, 2004-04-17 at 09:44, Matthias Urlichs wrote: > Hi, Trond Myklebust wrote: > > > As for blanket statements like the above: I have seen no evidence yet > > that they are any more warranted in 2.6.x than they were in 2.4.x. > > Oh, I saw the problem too: a slow client couldn't do full-size reads from > a fast server because the buffer on the client's network card was just 8k. Right, and this has always been a problem. I had the same issues when doing 8k reads on one of my 75MHz Pentiums some 10 years ago. The thing would more or less lock up and just pump out a constant stream of "time exceeded" ICMP messages. The NFS/RPC layer knows nothing about the existence of network cards or their buffer sizes. Only about sockets and how to read from/write to them. This sort of issue is precisely why I'd prefer to see people use TCP by default. UDP with it's dependency on fragmentation works fine on fast setups with homogeneous lossless networks. It sucks as soon as you break one of those conditions. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 18:15 ` Trond Myklebust @ 2004-04-17 18:32 ` Marc Singer 2004-04-17 18:58 ` Trond Myklebust 2004-04-17 19:01 ` Daniel Egger 0 siblings, 2 replies; 54+ messages in thread From: Marc Singer @ 2004-04-17 18:32 UTC (permalink / raw) To: linux-kernel On Sat, Apr 17, 2004 at 11:15:47AM -0700, Trond Myklebust wrote: > This sort of issue is precisely why I'd prefer to see people use TCP by > default. UDP with it's dependency on fragmentation works fine on fast > setups with homogeneous lossless networks. It sucks as soon as you break > one of those conditions. I'd be glad to compare TCP to UDP on my system. It's using an nfsroot mount. It looks like the support is there. What activates it? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 18:32 ` Marc Singer @ 2004-04-17 18:58 ` Trond Myklebust 2004-04-17 19:01 ` Marc Singer 2004-04-17 22:22 ` Marc Singer 2004-04-17 19:01 ` Daniel Egger 1 sibling, 2 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-17 18:58 UTC (permalink / raw) To: Marc Singer; +Cc: linux-kernel On Sat, 2004-04-17 at 11:32, Marc Singer wrote: > On Sat, Apr 17, 2004 at 11:15:47AM -0700, Trond Myklebust wrote: > > This sort of issue is precisely why I'd prefer to see people use TCP by > > default. UDP with it's dependency on fragmentation works fine on fast > > setups with homogeneous lossless networks. It sucks as soon as you break > > one of those conditions. > > I'd be glad to compare TCP to UDP on my system. It's using an nfsroot > mount. It looks like the support is there. What activates it? It's all there. Just use the "tcp" mount option. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 18:58 ` Trond Myklebust @ 2004-04-17 19:01 ` Marc Singer 2004-04-17 19:09 ` Trond Myklebust 2004-04-17 22:22 ` Marc Singer 1 sibling, 1 reply; 54+ messages in thread From: Marc Singer @ 2004-04-17 19:01 UTC (permalink / raw) To: Trond Myklebust; +Cc: Marc Singer, linux-kernel On Sat, Apr 17, 2004 at 11:58:33AM -0700, Trond Myklebust wrote: > > I'd be glad to compare TCP to UDP on my system. It's using an nfsroot > > mount. It looks like the support is there. What activates it? > > It's all there. Just use the "tcp" mount option. I think you are talking about the fstab mount option. Is there a kernel command line option for this? That's what I've been looking for. I'm not using an initrd. Cheers. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 19:01 ` Marc Singer @ 2004-04-17 19:09 ` Trond Myklebust 2004-04-17 19:19 ` Russell King 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-17 19:09 UTC (permalink / raw) To: Marc Singer; +Cc: linux-kernel On Sat, 2004-04-17 at 12:01, Marc Singer wrote: > I think you are talking about the fstab mount option. Is there a > kernel command line option for this? That's what I've been looking > for. I'm not using an initrd. No. I'm talking about the built-in parser to enable NFSROOT to pass mount options. As in: nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] See Documentation/nfsroot.txt. Put "tcp" as one of the "<nfs-options>", and your root partition will use TCP instead of UDP. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 19:09 ` Trond Myklebust @ 2004-04-17 19:19 ` Russell King 2004-04-18 2:51 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Russell King @ 2004-04-17 19:19 UTC (permalink / raw) To: Trond Myklebust; +Cc: Marc Singer, linux-kernel On Sat, Apr 17, 2004 at 12:09:24PM -0700, Trond Myklebust wrote: > On Sat, 2004-04-17 at 12:01, Marc Singer wrote: > > > I think you are talking about the fstab mount option. Is there a > > kernel command line option for this? That's what I've been looking > > for. I'm not using an initrd. > > No. I'm talking about the built-in parser to enable NFSROOT to pass > mount options. As in: > > nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] > > See Documentation/nfsroot.txt. Put "tcp" as one of the "<nfs-options>", > and your root partition will use TCP instead of UDP. Trond, Can you explain how this works? static int __init root_nfs_parse(char *name, char *buf) { ... while ((p = strsep (&name, ",")) != NULL) { int token; if (!*p) continue; token = match_token(p, tokens, args); /* %u tokens only */ if (match_int(&args[0], &option)) return 0; Firstly, as far as I can see, args[] is uninitialised. If match_token doesn't touch args[] then we pass match_int some uninitialised kernel memory. Secondly, we seem to exit if match_int doesn't parse a number. Not all options in "tokens" have a number associated with them, including ones like "tcp". So, given that "tcp" is the only option, I think we'll end up passing match_int() some uninitialised memory which may cause a kernel oops. If not, it probably won't be a valid number, so we'll ignore the option. However, it will appear to work as long as the first option has a number associated with it (ie, is one of the first 9 options.) -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/ 2.6 Serial core ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 19:19 ` Russell King @ 2004-04-18 2:51 ` Trond Myklebust 2004-04-19 16:39 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-18 2:51 UTC (permalink / raw) To: Russell King; +Cc: Marc Singer, linux-kernel On Sat, 2004-04-17 at 12:19, Russell King wrote: > Firstly, as far as I can see, args[] is uninitialised. If match_token > doesn't touch args[] then we pass match_int some uninitialised kernel > memory. > > Secondly, we seem to exit if match_int doesn't parse a number. Not > all options in "tokens" have a number associated with them, including > ones like "tcp". Agreed. The correct fix should be something like the appended patch. It depends on all tokens that do take an integer argument being listed first in the enum. Comments? Cheers, Trond nfsroot.c | 17 +++++++++++++---- 1 files changed, 13 insertions(+), 4 deletions(-) --- linux-2.6.6-up/fs/nfs/nfsroot.c.orig 2004-04-17 11:05:10.000000000 -0700 +++ linux-2.6.6-up/fs/nfs/nfsroot.c 2004-04-17 18:47:05.000000000 -0700 @@ -117,11 +117,16 @@ static int mount_port __initdata = 0; / ***************************************************************************/ enum { + /* Options that take integer arguments */ Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin, - Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr, + Opt_acregmax, Opt_acdirmin, Opt_acdirmax, + /* Options that take no arguments */ + Opt_soft, Opt_hard, Opt_intr, Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac, Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp, - Opt_broken_suid, Opt_err, + Opt_broken_suid, + /* Error token */ + Opt_err }; static match_table_t tokens = { @@ -146,9 +151,13 @@ static match_table_t tokens = { {Opt_noac, "noac"}, {Opt_lock, "lock"}, {Opt_nolock, "nolock"}, + {Opt_v2, "nfsvers=2"}, {Opt_v2, "v2"}, + {Opt_v3, "nfsvers=3"}, {Opt_v3, "v3"}, + {Opt_udp, "proto=udp"}, {Opt_udp, "udp"}, + {Opt_udp, "proto=tcp"}, {Opt_tcp, "tcp"}, {Opt_broken_suid, "broken_suid"}, {Opt_err, NULL} @@ -179,8 +188,8 @@ static int __init root_nfs_parse(char *n continue; token = match_token(p, tokens, args); - /* %u tokens only */ - if (match_int(&args[0], &option)) + /* %u tokens only. Beware if you add new tokens! */ + if (token < Opt_soft && match_int(&args[0], &option)) return 0; switch (token) { case Opt_port: ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 2:51 ` Trond Myklebust @ 2004-04-19 16:39 ` Trond Myklebust 2004-04-19 21:10 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-19 16:39 UTC (permalink / raw) To: Russell King; +Cc: Marc Singer, linux-kernel [-- Attachment #1: Type: text/plain, Size: 832 bytes --] On Sat, 2004-04-17 at 22:51, Trond Myklebust wrote: > On Sat, 2004-04-17 at 12:19, Russell King wrote: > > > Firstly, as far as I can see, args[] is uninitialised. If match_token > > doesn't touch args[] then we pass match_int some uninitialised kernel > > memory. > > > > Secondly, we seem to exit if match_int doesn't parse a number. Not > > all options in "tokens" have a number associated with them, including > > ones like "tcp". > > Agreed. The correct fix should be something like the appended patch. It > depends on all tokens that do take an integer argument being listed > first in the enum. > > Comments? It turned out there were a few extra issues that weren't fixed by the previous patch. Thanks to boris@macbeth.rhoen.de for helping debug them. Hopefully this will be the final set of fixes. Cheers, Trond [-- Attachment #2: Type: text/plain, Size: 2951 bytes --] nfsroot.c | 33 +++++++++++++++++++++------------ 1 files changed, 21 insertions(+), 12 deletions(-) diff -u --recursive --new-file --show-c-function linux-2.6.6-01-soft/fs/nfs/nfsroot.c linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c --- linux-2.6.6-01-soft/fs/nfs/nfsroot.c 2004-04-17 23:01:09.000000000 -0400 +++ linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c 2004-04-19 12:08:31.000000000 -0400 @@ -117,11 +117,16 @@ static int mount_port __initdata = 0; / ***************************************************************************/ enum { + /* Options that take integer arguments */ Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin, - Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr, + Opt_acregmax, Opt_acdirmin, Opt_acdirmax, + /* Options that take no arguments */ + Opt_soft, Opt_hard, Opt_intr, Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac, Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp, - Opt_broken_suid, Opt_err, + Opt_broken_suid, + /* Error token */ + Opt_err }; static match_table_t tokens = { @@ -146,9 +151,13 @@ static match_table_t tokens = { {Opt_noac, "noac"}, {Opt_lock, "lock"}, {Opt_nolock, "nolock"}, + {Opt_v2, "nfsvers=2"}, {Opt_v2, "v2"}, + {Opt_v3, "nfsvers=3"}, {Opt_v3, "v3"}, + {Opt_udp, "proto=udp"}, {Opt_udp, "udp"}, + {Opt_tcp, "proto=tcp"}, {Opt_tcp, "tcp"}, {Opt_broken_suid, "broken_suid"}, {Opt_err, NULL} @@ -162,25 +171,21 @@ static match_table_t tokens = { static int __init root_nfs_parse(char *name, char *buf) { - char *p; + char *p, *path = name; substring_t args[MAX_OPT_ARGS]; int option; if (!name) return 1; - if (name[0] && strcmp(name, "default")){ - strlcpy(buf, name, NFS_MAXPATHLEN); - return 1; - } while ((p = strsep (&name, ",")) != NULL) { int token; if (!*p) continue; token = match_token(p, tokens, args); - /* %u tokens only */ - if (match_int(&args[0], &option)) + /* %u tokens only. Beware if you add new tokens! */ + if (token < Opt_soft && match_int(&args[0], &option)) return 0; switch (token) { case Opt_port: @@ -265,6 +270,13 @@ static int __init root_nfs_parse(char *n return 0; } } + + /* + * Copy the NFS remote path to the output buffer. + * Relies on strsep() having converted the delimiting ',' to '\0'. + */ + if (path[0] != '\0' && strcmp(path, "default") != 0) + strlcpy(buf, path, NFS_MAXPATHLEN); return 1; } @@ -283,9 +295,6 @@ static int __init root_nfs_name(char *na nfs_data.flags = NFS_MOUNT_NONLM; /* No lockd in nfs root yet */ nfs_data.rsize = NFS_DEF_FILE_IO_BUFFER_SIZE; nfs_data.wsize = NFS_DEF_FILE_IO_BUFFER_SIZE; - nfs_data.bsize = 0; - nfs_data.timeo = 7; - nfs_data.retrans = 3; nfs_data.acregmin = 3; nfs_data.acregmax = 60; nfs_data.acdirmin = 30; ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-19 16:39 ` Trond Myklebust @ 2004-04-19 21:10 ` Trond Myklebust 0 siblings, 0 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-19 21:10 UTC (permalink / raw) To: Russell King; +Cc: Marc Singer, linux-kernel [-- Attachment #1: Type: text/plain, Size: 417 bytes --] On Mon, 2004-04-19 at 12:39, Trond Myklebust wrote: > It turned out there were a few extra issues that weren't fixed by the > previous patch. Thanks to boris@macbeth.rhoen.de for helping debug them. > > Hopefully this will be the final set of fixes. Sigh. It wasn't... The remote path was still not getting set properly. Here's the final version. Tested, and it should now work according to spec! Cheers, Trond [-- Attachment #2: Type: text/plain, Size: 2718 bytes --] nfsroot.c | 30 +++++++++++++++++++----------- 1 files changed, 19 insertions(+), 11 deletions(-) diff -u --recursive --new-file --show-c-function linux-2.6.6-01-soft/fs/nfs/nfsroot.c linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c --- linux-2.6.6-01-soft/fs/nfs/nfsroot.c 2004-04-19 12:27:51.000000000 -0400 +++ linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c 2004-04-19 16:26:12.000000000 -0400 @@ -117,11 +117,16 @@ static int mount_port __initdata = 0; / ***************************************************************************/ enum { + /* Options that take integer arguments */ Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin, - Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr, + Opt_acregmax, Opt_acdirmin, Opt_acdirmax, + /* Options that take no arguments */ + Opt_soft, Opt_hard, Opt_intr, Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac, Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp, - Opt_broken_suid, Opt_err, + Opt_broken_suid, + /* Error token */ + Opt_err }; static match_table_t tokens = { @@ -146,9 +151,13 @@ static match_table_t tokens = { {Opt_noac, "noac"}, {Opt_lock, "lock"}, {Opt_nolock, "nolock"}, + {Opt_v2, "nfsvers=2"}, {Opt_v2, "v2"}, + {Opt_v3, "nfsvers=3"}, {Opt_v3, "v3"}, + {Opt_udp, "proto=udp"}, {Opt_udp, "udp"}, + {Opt_tcp, "proto=tcp"}, {Opt_tcp, "tcp"}, {Opt_broken_suid, "broken_suid"}, {Opt_err, NULL} @@ -169,18 +178,19 @@ static int __init root_nfs_parse(char *n if (!name) return 1; - if (name[0] && strcmp(name, "default")){ - strlcpy(buf, name, NFS_MAXPATHLEN); - return 1; - } + /* Set the NFS remote path */ + p = strsep(&name, ","); + if (p[0] != '\0' && strcmp(p, "default") != 0) + strlcpy(buf, p, NFS_MAXPATHLEN); + while ((p = strsep (&name, ",")) != NULL) { int token; if (!*p) continue; token = match_token(p, tokens, args); - /* %u tokens only */ - if (match_int(&args[0], &option)) + /* %u tokens only. Beware if you add new tokens! */ + if (token < Opt_soft && match_int(&args[0], &option)) return 0; switch (token) { case Opt_port: @@ -265,6 +275,7 @@ static int __init root_nfs_parse(char *n return 0; } } + return 1; } @@ -283,9 +294,6 @@ static int __init root_nfs_name(char *na nfs_data.flags = NFS_MOUNT_NONLM; /* No lockd in nfs root yet */ nfs_data.rsize = NFS_DEF_FILE_IO_BUFFER_SIZE; nfs_data.wsize = NFS_DEF_FILE_IO_BUFFER_SIZE; - nfs_data.bsize = 0; - nfs_data.timeo = 7; - nfs_data.retrans = 3; nfs_data.acregmin = 3; nfs_data.acregmax = 60; nfs_data.acdirmin = 30; ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 18:58 ` Trond Myklebust 2004-04-17 19:01 ` Marc Singer @ 2004-04-17 22:22 ` Marc Singer 2004-04-18 0:57 ` Trond Myklebust 1 sibling, 1 reply; 54+ messages in thread From: Marc Singer @ 2004-04-17 22:22 UTC (permalink / raw) To: Trond Myklebust; +Cc: Marc Singer, linux-kernel On Sat, Apr 17, 2004 at 11:58:33AM -0700, Trond Myklebust wrote: > > I'd be glad to compare TCP to UDP on my system. It's using an nfsroot > > mount. It looks like the support is there. What activates it? > > It's all there. Just use the "tcp" mount option. > I have a data point for comparison. I'm copying a 40MiB file over NFS. In five trials, the mean transfer times are UDP (v2): 48.5s TCP (v3): 52.7s ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 22:22 ` Marc Singer @ 2004-04-18 0:57 ` Trond Myklebust 2004-04-18 5:01 ` Marc Singer 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-18 0:57 UTC (permalink / raw) To: Marc Singer; +Cc: linux-kernel On Sat, 2004-04-17 at 15:22, Marc Singer wrote: > I have a data point for comparison. > > I'm copying a 40MiB file over NFS. In five trials, the mean transfer > times are > > UDP (v2): 48.5s > TCP (v3): 52.7s Against what kind of server on what kind of network, with what kind of mount options? The above would be quite reasonable performance on a 10Mbit network against a filer or a Linux server with the (insecure) "async" option set. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 0:57 ` Trond Myklebust @ 2004-04-18 5:01 ` Marc Singer 2004-04-18 6:36 ` Chris Friesen 0 siblings, 1 reply; 54+ messages in thread From: Marc Singer @ 2004-04-18 5:01 UTC (permalink / raw) To: Trond Myklebust; +Cc: Marc Singer, linux-kernel On Sat, Apr 17, 2004 at 05:57:46PM -0700, Trond Myklebust wrote: > On Sat, 2004-04-17 at 15:22, Marc Singer wrote: > > I have a data point for comparison. > > > > I'm copying a 40MiB file over NFS. In five trials, the mean transfer > > times are > > > > UDP (v2): 48.5s > > TCP (v3): 52.7s > > Against what kind of server on what kind of network, with what kind of > mount options? > The above would be quite reasonable performance on a 10Mbit network > against a filer or a Linux server with the (insecure) "async" option > set. Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the kernel nfs daemon; network is 100Mib. There is nothing else on the network except intermittent broadband traffic. Async is set on the server side. While I have seen much worse performance in the last couple of weeks, I cannot blame NFS when I look at the numbers. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 5:01 ` Marc Singer @ 2004-04-18 6:36 ` Chris Friesen 2004-04-18 7:56 ` Russell King 0 siblings, 1 reply; 54+ messages in thread From: Chris Friesen @ 2004-04-18 6:36 UTC (permalink / raw) To: Marc Singer; +Cc: Trond Myklebust, linux-kernel Marc Singer wrote: > Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the > kernel nfs daemon; network is 100Mib. There is nothing else on the > network except intermittent broadband traffic. Async is set on the > server side. Is the ARM that slow? under 2MB/s seems odd to me...but them maybe I'm used to faster machines. Chris ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 6:36 ` Chris Friesen @ 2004-04-18 7:56 ` Russell King 2004-04-18 17:31 ` Marc Singer 0 siblings, 1 reply; 54+ messages in thread From: Russell King @ 2004-04-18 7:56 UTC (permalink / raw) To: Chris Friesen; +Cc: Marc Singer, Trond Myklebust, linux-kernel On Sun, Apr 18, 2004 at 02:36:14AM -0400, Chris Friesen wrote: > Marc Singer wrote: > > > Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the > > kernel nfs daemon; network is 100Mib. There is nothing else on the > > network except intermittent broadband traffic. Async is set on the > > server side. > > Is the ARM that slow? under 2MB/s seems odd to me...but them maybe I'm > used to faster machines. It's probably the SMC91c111 ether chip causing all the problem - it's only able to store about 4 packets before it starts dropping, which isn't that much on a 100mbit network. Running with rsize=4096 works wonders with this chip. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/ 2.6 Serial core ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 7:56 ` Russell King @ 2004-04-18 17:31 ` Marc Singer 0 siblings, 0 replies; 54+ messages in thread From: Marc Singer @ 2004-04-18 17:31 UTC (permalink / raw) To: Chris Friesen, Marc Singer, Trond Myklebust, linux-kernel On Sun, Apr 18, 2004 at 08:56:19AM +0100, Russell King wrote: > On Sun, Apr 18, 2004 at 02:36:14AM -0400, Chris Friesen wrote: > > Marc Singer wrote: > > > > > Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the > > > kernel nfs daemon; network is 100Mib. There is nothing else on the > > > network except intermittent broadband traffic. Async is set on the > > > server side. > > > > Is the ARM that slow? under 2MB/s seems odd to me...but them maybe I'm > > used to faster machines. > > It's probably the SMC91c111 ether chip causing all the problem - it's > only able to store about 4 packets before it starts dropping, which > isn't that much on a 100mbit network. I suspect that it might be a CPU issue. On transmit only, it never gets above 18Mib. > Running with rsize=4096 works wonders with this chip. Already there. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 18:32 ` Marc Singer 2004-04-17 18:58 ` Trond Myklebust @ 2004-04-17 19:01 ` Daniel Egger 2004-04-17 20:22 ` Marc Singer 1 sibling, 1 reply; 54+ messages in thread From: Daniel Egger @ 2004-04-17 19:01 UTC (permalink / raw) To: Marc Singer; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 433 bytes --] On 17.04.2004, at 20:32, Marc Singer wrote: > I'd be glad to compare TCP to UDP on my system. It's using an nfsroot > mount. It looks like the support is there. What activates it? You need to add at least tcp as parameter to the nfsroot boot option, like nfsroot=1.1.1.1:/tftpboot/foo,tcp,v3 . And, of course, if you mount/remount NFS partitions you also need to specify the tcp parameter in your fstab. Servus, Daniel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 478 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 19:01 ` Daniel Egger @ 2004-04-17 20:22 ` Marc Singer 2004-04-18 11:14 ` Daniel Egger 0 siblings, 1 reply; 54+ messages in thread From: Marc Singer @ 2004-04-17 20:22 UTC (permalink / raw) To: Daniel Egger; +Cc: Marc Singer, linux-kernel On Sat, Apr 17, 2004 at 09:01:38PM +0200, Daniel Egger wrote: > On 17.04.2004, at 20:32, Marc Singer wrote: > > >I'd be glad to compare TCP to UDP on my system. It's using an nfsroot > >mount. It looks like the support is there. What activates it? > > You need to add at least tcp as parameter to the nfsroot boot option, > like nfsroot=1.1.1.1:/tftpboot/foo,tcp,v3 . What I'd like to do is use a command line like this root=/dev/nfs ip=rarp nfsroot=,tcp,v3 But, it doesn't work. I'd like to let the kernel autoconfiguration handle the addressing. > And, of course, if you mount/remount NFS partitions you also need to > specify the tcp parameter in your fstab. > > Servus, > Daniel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 20:22 ` Marc Singer @ 2004-04-18 11:14 ` Daniel Egger 0 siblings, 0 replies; 54+ messages in thread From: Daniel Egger @ 2004-04-18 11:14 UTC (permalink / raw) To: Marc Singer; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 665 bytes --] On 17.04.2004, at 22:22, Marc Singer wrote: > What I'd like to do is use a command line like this > > root=/dev/nfs ip=rarp nfsroot=,tcp,v3 > > But, it doesn't work. I'd like to let the kernel autoconfiguration > handle the addressing. According to Documentation/nfsroot.txt you should be able to do: root=/dev/nfs ip=rarp nfsroot=/kernel,tcp,v3 i.e. the ip is optional. Just out of curiosity: How would you supply the kernel name using rarp/bootp/dhcp? Since a few days I'm using pxelinux but before that I needed to hardcode the path into the tagged image. Actually I prefer this to restarting the restarting the dhcp server, but... Servus, Daniel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 478 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 16:44 ` Matthias Urlichs 2004-04-17 18:15 ` Trond Myklebust @ 2004-04-19 9:06 ` Helge Hafting 1 sibling, 0 replies; 54+ messages in thread From: Helge Hafting @ 2004-04-19 9:06 UTC (permalink / raw) To: Matthias Urlichs, linux-kernel Matthias Urlichs wrote: > Hi, Trond Myklebust wrote: > > >>As for blanket statements like the above: I have seen no evidence yet >>that they are any more warranted in 2.6.x than they were in 2.4.x. > > > Oh, I saw the problem too: a slow client couldn't do full-size reads from > a fast server because the buffer on the client's network card was just 8k. > You can force nfs to use smaller packets, useful for those who have to use udp because the server doesn't support nfs over tcp. Try 8k, or even 4k. Helge Hafting ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 1:53 ` Andrew Morton 2004-04-16 2:54 ` Trond Myklebust @ 2004-04-16 9:03 ` Jamie Lokier 2004-04-16 15:55 ` Trond Myklebust 1 sibling, 1 reply; 54+ messages in thread From: Jamie Lokier @ 2004-04-16 9:03 UTC (permalink / raw) To: Andrew Morton; +Cc: Trond Myklebust, shannon, linux-kernel Andrew Morton wrote: > > På to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix: > But Charles was seeing good performance with 2.4-based clients. When he > went to 2.6 everything fell apart. Perhaps because 2.6 changes the UDP retransmit model for NFS, to estimate the round-trip time and thus retransmit faster than 2.4 would. Sometimes _much_ faster: I observed retransmits within a few hundred microseconds. On networks with a lot of latency variance, i.e. anything with big queues, that would increase congestion. That'd increase losses, and because NFS over UDP uses large fragmented IP frames (TCP doesn't), fragment loss will greatly increase IP frame loss, as Trond explained. That's my hypothesis. There was also a problem with late 2.5 clients and "soft" NFS mounts. Requests would timeout after a fixed number of retransmits, which on a LAN could be after a few milliseconds due to round-trip estimation and fast server response. Then when an I/O on the server took longer, e.g. due to a disk seek or contention, the client would timeout and abort requests. 2.4 doesn't have this problem with "soft" due to the longer, fixed retransmit timeout. I don't know if it is fixed in current 2.6 kernels - but you can avoid it by not using "soft" anyway. -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 9:03 ` Jamie Lokier @ 2004-04-16 15:55 ` Trond Myklebust 2004-04-16 18:48 ` Jamie Lokier 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-16 15:55 UTC (permalink / raw) To: Jamie Lokier; +Cc: Andrew Morton, shannon, linux-kernel On Fri, 2004-04-16 at 02:03, Jamie Lokier wrote: > Perhaps because 2.6 changes the UDP retransmit model for NFS, to > estimate the round-trip time and thus retransmit faster than 2.4 > would. Sometimes _much_ faster: I observed retransmits within a few > hundred microseconds. Retransmits within a few 100 microsecond should no longer be occurring. Have you redone those measurements with a more recent kernel? 2.6.x and 2.4.x should have pretty much the same code for RTO estimation. In fact pretty much all the 2.4.x and 2.6.x RPC code is shared. The one difference is that 2.6.x uses zero copy writes. > There was also a problem with late 2.5 clients and "soft" NFS mounts. > Requests would timeout after a fixed number of retransmits, which on a > LAN could be after a few milliseconds due to round-trip estimation and > fast server response. Then when an I/O on the server took longer, > e.g. due to a disk seek or contention, the client would timeout and > abort requests. 2.4 doesn't have this problem with "soft" due to the > longer, fixed retransmit timeout. I don't know if it is fixed in > current 2.6 kernels - but you can avoid it by not using "soft" anyway. Or changing the default value of "retrans" to something more sane. As usual, Linux has a default that is lower than on any other platform. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 15:55 ` Trond Myklebust @ 2004-04-16 18:48 ` Jamie Lokier 2004-04-16 19:06 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Jamie Lokier @ 2004-04-16 18:48 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, shannon, linux-kernel Trond Myklebust wrote: > > Perhaps because 2.6 changes the UDP retransmit model for NFS, to > > estimate the round-trip time and thus retransmit faster than 2.4 > > would. Sometimes _much_ faster: I observed retransmits within a few > > hundred microseconds. > > Retransmits within a few 100 microsecond should no longer be occurring. > Have you redone those measurements with a more recent kernel? No, not since I sent you the packet trace from a 2.5 kernel that wasn't working with "soft". I took your advice and stopped using "soft". It causes the obvious problem when I (rarely) turn off the server, otherwise it's been fine and I'm using 2.6.5 now, still fine (with "soft" not being used). > 2.6.x and 2.4.x should have pretty much the same code for RTO > estimation. > > In fact pretty much all the 2.4.x and 2.6.x RPC code is shared. The one > difference is that 2.6.x uses zero copy writes. > > > There was also a problem with late 2.5 clients and "soft" NFS mounts. > > Requests would timeout after a fixed number of retransmits, which on a > > LAN could be after a few milliseconds due to round-trip estimation and > > fast server response. Then when an I/O on the server took longer, > > e.g. due to a disk seek or contention, the client would timeout and > > abort requests. 2.4 doesn't have this problem with "soft" due to the > > longer, fixed retransmit timeout. I don't know if it is fixed in > > current 2.6 kernels - but you can avoid it by not using "soft" anyway. > > Or changing the default value of "retrans" to something more sane. As > usual, Linux has a default that is lower than on any other platform. If few-100-microsecond retransmits no longer occur, perhaps it's no longer relevant. The problem I saw with "soft" was that the retransmit time was quite a good estimate of the server response time. That part was fine, nice even. But then the server response latency would increase by a factor of 10000 (ten thousand) due to normal disk I/O activity (compare cache response with disk response on a busy disk), and of course 3 retransmits doubling each time is not adequate to cover that. 2.4 was fine because the default rtt and retrans together could never get shorter than a few seconds. That's why I felt that iff rtt was adapting to the server response time, then a fixed number of retransmits was no longer appropriate: a lower bound on the time before timing out is appropriate, e.g. 3 seconds or 10 seconds or whatever. In other words, with adaptive rtt the concept of "retrans" being a fixed number is fundamentally flawed -- unless it's also accompanied by a minimum timeout time. You'd need a retrans value of 20 or so for the above perfectly normal LAN situation, but then that's far too large on other occasions with other networks or servers. -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 18:48 ` Jamie Lokier @ 2004-04-16 19:06 ` Trond Myklebust 2004-04-16 19:39 ` Jamie Lokier 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-16 19:06 UTC (permalink / raw) To: Jamie Lokier; +Cc: Andrew Morton, shannon, linux-kernel On Fri, 2004-04-16 at 11:48, Jamie Lokier wrote: > In other words, with adaptive rtt the concept of "retrans" being a > fixed number is fundamentally flawed -- unless it's also accompanied > by a minimum timeout time. You'd need a retrans value of 20 or so for > the above perfectly normal LAN situation, but then that's far too > large on other occasions with other networks or servers. At that point, it makes sense to drop the entire "retrans+timeo" paradigm, and just state that soft timeouts take a single parameter ("timeo") that determines the timeout value. That's something that is dead easy to do... Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 19:06 ` Trond Myklebust @ 2004-04-16 19:39 ` Jamie Lokier 2004-04-17 22:32 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Jamie Lokier @ 2004-04-16 19:39 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, shannon, linux-kernel Trond Myklebust wrote: > > In other words, with adaptive rtt the concept of "retrans" being a > > fixed number is fundamentally flawed -- unless it's also accompanied > > by a minimum timeout time. You'd need a retrans value of 20 or so for > > the above perfectly normal LAN situation, but then that's far too > > large on other occasions with other networks or servers. > > At that point, it makes sense to drop the entire "retrans+timeo" > paradigm, and just state that soft timeouts take a single parameter > ("timeo") that determines the timeout value. I agree. 30 seconds seems like a good default. > That's something that is dead easy to do... I'll test a patch for 2.6.5 if you provide one. -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-16 19:39 ` Jamie Lokier @ 2004-04-17 22:32 ` Trond Myklebust 2004-04-18 3:26 ` Jamie Lokier 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-17 22:32 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 752 bytes --] On Fri, 2004-04-16 at 12:39, Jamie Lokier wrote: > > That's something that is dead easy to do... > > I'll test a patch for 2.6.5 if you provide one. Here you go... With this patch - the major timeout is of fixed length "timeo<<retrans", and the clock starts at the first attempt to send the packet. - If a major timeout occurs, we now reset the RTT estimator so as to "slow start" when the server becomes available again. For the moment it does use the timeo + retrans values, because the former is in fact wanted in order to initialize the RTT estimator. However, it no longer uses the count of the number of actual retransmissions in order to determine whether or not a major timeout occurred. Cheers, Trond [-- Attachment #2: linux-2.6.6-01-soft.dif --] [-- Type: text/plain, Size: 9304 bytes --] include/linux/sunrpc/xprt.h | 10 ++-- net/sunrpc/auth_gss/auth_gss.c | 2 net/sunrpc/clnt.c | 4 - net/sunrpc/timer.c | 1 net/sunrpc/xprt.c | 91 +++++++++++++++++++++++++---------------- 5 files changed, 63 insertions(+), 45 deletions(-) diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/include/linux/sunrpc/xprt.h linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h --- linux-2.6.6-pre1/include/linux/sunrpc/xprt.h 2004-04-17 11:05:10.000000000 -0700 +++ linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h 2004-04-17 13:55:40.000000000 -0700 @@ -69,8 +69,7 @@ extern unsigned int xprt_tcp_slot_table_ * This describes a timeout strategy */ struct rpc_timeout { - unsigned long to_current, /* current timeout */ - to_initval, /* initial timeout */ + unsigned long to_initval, /* initial timeout */ to_maxval, /* max timeout */ to_increment; /* if !exponential */ unsigned int to_retries; /* max # of retries */ @@ -85,7 +84,6 @@ struct rpc_rqst { * This is the user-visible part */ struct rpc_xprt * rq_xprt; /* RPC client */ - struct rpc_timeout rq_timeout; /* timeout parms */ struct xdr_buf rq_snd_buf; /* send buffer */ struct xdr_buf rq_rcv_buf; /* recv buffer */ @@ -103,6 +101,9 @@ struct rpc_rqst { struct xdr_buf rq_private_buf; /* The receive buffer * used in the softirq. */ + unsigned long rq_majortimeo; /* major timeout alarm */ + unsigned long rq_timeout; /* Current timeout value */ + unsigned int rq_retries; /* # of retries */ /* * For authentication (e.g. auth_des) */ @@ -115,7 +116,6 @@ struct rpc_rqst { u32 rq_bytes_sent; /* Bytes we have sent */ unsigned long rq_xtime; /* when transmitted */ - int rq_ntimeo; int rq_ntrans; }; #define rq_svec rq_snd_buf.head @@ -210,7 +210,7 @@ void xprt_reserve(struct rpc_task *); int xprt_prepare_transmit(struct rpc_task *); void xprt_transmit(struct rpc_task *); void xprt_receive(struct rpc_task *); -int xprt_adjust_timeout(struct rpc_timeout *); +int xprt_adjust_timeout(struct rpc_rqst *req); void xprt_release(struct rpc_task *); void xprt_connect(struct rpc_task *); int xprt_clear_backlog(struct rpc_xprt *); diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/auth_gss/auth_gss.c linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c --- linux-2.6.6-pre1/net/sunrpc/auth_gss/auth_gss.c 2004-04-17 11:04:59.000000000 -0700 +++ linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c 2004-04-17 14:31:29.000000000 -0700 @@ -736,10 +736,8 @@ static int gss_refresh(struct rpc_task *task) { struct rpc_clnt *clnt = task->tk_client; - struct rpc_xprt *xprt = task->tk_xprt; struct rpc_cred *cred = task->tk_msg.rpc_cred; - task->tk_timeout = xprt->timeout.to_current; if (!gss_cred_is_uptodate_ctx(cred)) return gss_upcall(clnt, task, cred); return 0; diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/clnt.c linux-2.6.6-01-soft/net/sunrpc/clnt.c --- linux-2.6.6-pre1/net/sunrpc/clnt.c 2004-04-17 11:04:57.000000000 -0700 +++ linux-2.6.6-01-soft/net/sunrpc/clnt.c 2004-04-17 15:05:14.000000000 -0700 @@ -788,13 +788,11 @@ static void call_timeout(struct rpc_task *task) { struct rpc_clnt *clnt = task->tk_client; - struct rpc_timeout *to = &task->tk_rqstp->rq_timeout; - if (xprt_adjust_timeout(to)) { + if (xprt_adjust_timeout(task->tk_rqstp) == 0) { dprintk("RPC: %4d call_timeout (minor)\n", task->tk_pid); goto retry; } - to->to_retries = clnt->cl_timeout.to_retries; dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid); if (RPC_IS_SOFT(task)) { diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/timer.c linux-2.6.6-01-soft/net/sunrpc/timer.c --- linux-2.6.6-pre1/net/sunrpc/timer.c 2004-04-17 11:05:23.000000000 -0700 +++ linux-2.6.6-01-soft/net/sunrpc/timer.c 2004-04-17 15:02:33.000000000 -0700 @@ -39,6 +39,7 @@ rpc_init_rtt(struct rpc_rtt *rt, unsigne for (i = 0; i < 5; i++) { rt->srtt[i] = init; rt->sdrtt[i] = RPC_RTO_INIT; + rt->ntimeouts[i] = 0; } } diff -u --recursive --new-file --show-c-function linux-2.6.6-pre1/net/sunrpc/xprt.c linux-2.6.6-01-soft/net/sunrpc/xprt.c --- linux-2.6.6-pre1/net/sunrpc/xprt.c 2004-04-17 11:05:09.000000000 -0700 +++ linux-2.6.6-01-soft/net/sunrpc/xprt.c 2004-04-17 15:21:56.000000000 -0700 @@ -352,35 +352,59 @@ xprt_adjust_cwnd(struct rpc_xprt *xprt, } /* + * Reset the major timeout value + */ +static void xprt_reset_majortimeo(struct rpc_rqst *req) +{ + struct rpc_timeout *to = &req->rq_xprt->timeout; + + req->rq_majortimeo = req->rq_timeout; + if (to->to_exponential) + req->rq_majortimeo <<= to->to_retries; + else + req->rq_majortimeo += to->to_increment * to->to_retries; + if (req->rq_majortimeo > to->to_maxval || req->rq_majortimeo == 0) + req->rq_majortimeo = to->to_maxval; + req->rq_majortimeo += jiffies; +} + +/* * Adjust timeout values etc for next retransmit */ -int -xprt_adjust_timeout(struct rpc_timeout *to) +int xprt_adjust_timeout(struct rpc_rqst *req) { - if (to->to_retries > 0) { - if (to->to_exponential) - to->to_current <<= 1; - else - to->to_current += to->to_increment; - if (to->to_maxval && to->to_current >= to->to_maxval) - to->to_current = to->to_maxval; + struct rpc_xprt *xprt = req->rq_xprt; + struct rpc_timeout *to = &xprt->timeout; + int status = 0; + + if (time_before(jiffies, req->rq_majortimeo)) { + if (req->rq_retries < to->to_retries) { + if (to->to_exponential) + req->rq_timeout <<= 1; + else + req->rq_timeout += to->to_increment; + if (to->to_maxval && req->rq_timeout >= to->to_maxval) + req->rq_timeout = to->to_maxval; + req->rq_retries++; + } + pprintk("RPC: %lu retrans\n", jiffies); } else { - if (to->to_exponential) - to->to_initval <<= 1; - else - to->to_initval += to->to_increment; - if (to->to_maxval && to->to_initval >= to->to_maxval) - to->to_initval = to->to_maxval; - to->to_current = to->to_initval; - } - - if (!to->to_current) { - printk(KERN_WARNING "xprt_adjust_timeout: to_current = 0!\n"); - to->to_current = 5 * HZ; - } - pprintk("RPC: %lu %s\n", jiffies, - to->to_retries? "retrans" : "timeout"); - return to->to_retries-- > 0; + req->rq_timeout = to->to_initval; + req->rq_retries = 0; + xprt_reset_majortimeo(req); + /* Reset the RTT counters == "slow start" */ + spin_lock_bh(&xprt->sock_lock); + rpc_init_rtt(req->rq_task->tk_client->cl_rtt, to->to_initval); + spin_unlock_bh(&xprt->sock_lock); + pprintk("RPC: %lu timeout\n", jiffies); + status = -ETIMEDOUT; + } + + if (req->rq_timeout == 0) { + printk(KERN_WARNING "xprt_adjust_timeout: rq_timeout = 0!\n"); + req->rq_timeout = 5 * HZ; + } + return status; } /* @@ -1166,6 +1190,7 @@ xprt_transmit(struct rpc_task *task) /* Add request to the receive list */ list_add_tail(&req->rq_list, &xprt->recv); spin_unlock_bh(&xprt->sock_lock); + xprt_reset_majortimeo(req); } } else if (!req->rq_bytes_sent) return; @@ -1221,7 +1246,7 @@ xprt_transmit(struct rpc_task *task) if (!xprt_connected(xprt)) task->tk_status = -ENOTCONN; else if (test_bit(SOCK_NOSPACE, &xprt->sock->flags)) { - task->tk_timeout = req->rq_timeout.to_current; + task->tk_timeout = req->rq_timeout; rpc_sleep_on(&xprt->pending, task, NULL, NULL); } spin_unlock_bh(&xprt->sock_lock); @@ -1248,13 +1273,11 @@ xprt_transmit(struct rpc_task *task) if (!xprt->nocong) { int timer = task->tk_msg.rpc_proc->p_timer; task->tk_timeout = rpc_calc_rto(clnt->cl_rtt, timer); - task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer); - task->tk_timeout <<= clnt->cl_timeout.to_retries - - req->rq_timeout.to_retries; - if (task->tk_timeout > req->rq_timeout.to_maxval) - task->tk_timeout = req->rq_timeout.to_maxval; + task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer) + req->rq_retries; + if (task->tk_timeout > xprt->timeout.to_maxval || task->tk_timeout == 0) + task->tk_timeout = xprt->timeout.to_maxval; } else - task->tk_timeout = req->rq_timeout.to_current; + task->tk_timeout = req->rq_timeout; /* Don't race with disconnect */ if (!xprt_connected(xprt)) task->tk_status = -ENOTCONN; @@ -1324,7 +1347,7 @@ xprt_request_init(struct rpc_task *task, { struct rpc_rqst *req = task->tk_rqstp; - req->rq_timeout = xprt->timeout; + req->rq_timeout = xprt->timeout.to_initval; req->rq_task = task; req->rq_xprt = xprt; req->rq_xid = xprt_alloc_xid(xprt); @@ -1381,7 +1404,6 @@ xprt_default_timeout(struct rpc_timeout void xprt_set_timeout(struct rpc_timeout *to, unsigned int retr, unsigned long incr) { - to->to_current = to->to_initval = to->to_increment = incr; to->to_maxval = incr * retr; @@ -1446,7 +1468,6 @@ xprt_setup(int proto, struct sockaddr_in /* Set timeout parameters */ if (to) { xprt->timeout = *to; - xprt->timeout.to_current = to->to_initval; } else xprt_default_timeout(&xprt->timeout, xprt->prot); ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 22:32 ` Trond Myklebust @ 2004-04-18 3:26 ` Jamie Lokier 2004-04-18 7:03 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Jamie Lokier @ 2004-04-18 3:26 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-kernel Trond Myklebust wrote: > With this patch > - the major timeout is of fixed length "timeo<<retrans", and the > clock starts at the first attempt to send the packet. > - If a major timeout occurs, we now reset the RTT estimator so > as to "slow start" when the server becomes available again. > > For the moment it does use the timeo + retrans values, because the > former is in fact wanted in order to initialize the RTT estimator. > However, it no longer uses the count of the number of actual > retransmissions in order to determine whether or not a major timeout > occurred. Ok, observations: - The RTT converges to 0.1s on my LAN, just as it did before the patch. Very sensible, and as you said the 100 microsecond problem is not with us these days. - The RTT is reset after a timeout (from 0.1-0.15s to 0.7s in my tests). As expected. - With the defaults (retrans=3, timeo=0.7s), I see: After disconnecting the server, the client first times out after about 5.5-6 seconds. First minor timeout is 0.1. This makes sense as 0.7 << 3 == 5.6. Subsequent timeouts take about 10.5 seconds. This also makes sense, as you have set the timeout threshold at 0.7*8 == 5.6 seconds, and three timeouts is 0.7*(1+2+4) == 4.9 seconds, too short. Four timeouts is 0.7*(1+2+4+8) == 10.5 seconds. The old behaviour before RTT estimation would have timed out after 10.5 seconds, I think. - With retrans=5, and timeo still has the default value of 0.7s: After disconnecting the server, the minor timeout intervals are approximately: 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2 seconds. Are they intended to stop doubling at 3.2? The major timeout thus happens after 22.3 seconds. Unsurprisingly, subsequent major timeouts take 44.1 seconds. So this patch is a big improvment, and I'm going to keep using it for my home directory with retrans=5,soft so it gets some more background testing. (retrans=3 is too short even with the patch). However, there are potential improvements. One is that the 3.2 above should continue doubling. The other is that behaviour would be nicer if the major timeout time was more predictable: 22.3 to 44.1 seconds is a big range. This is easy with the algorithm described below. It isn't possible to have remove the variation completely. However, it can easily by reduced by changing the doubling strategy: keep doubling the retransmit time, until it exceeds timeo. When that happens, set the retransmit time to the next greater or equal value of timeo << N for some integer N. For example, with RTT at 0.1s, retrans=5, timeo=0.7, these would be the minor timeout intervals: 0.1, 0.2, 0.4, 0.7, 1.4, 2.8, 5.6, 11.2, 22.4 leading to a total major timeout time of 44.8 seconds. Subsequent major timeouts, with the RTT reset to 0.7s, would take 44.1 seconds: 0.7, 1.4, 2.8, 5.6, 11.2, 22.4. If the RTT estimator is larger than timeo to start with, the first retransmit will timeout after RTT, but subsequent ones will be a value of timeo << N. E.g. if RTT was 2s, this would be the minor timeout sequence: 2.0, 2.8, 5.6, 11.2, 22.4. The algorithm for deciding when a major timeout occurs is different too. Instead of keeping track of the total time since the very first transmission, you simply deem the major timeout to occur after the minor timeout of timeo << retrans occurs. I.e. in these examples, the 22.4s minor timeout is always the final one. This reduces the possible variation, with these parameters, to the range 44.1 to 45.325 seconds: much more consistent than 22.05 to 44.1 seconds. As well as giving more consistent results, this might even be simpler than the algorithm in your patch, because there is no need to remember the total time since the first transmission. -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 3:26 ` Jamie Lokier @ 2004-04-18 7:03 ` Trond Myklebust 2004-04-18 23:22 ` Jamie Lokier 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-18 7:03 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Sat, 2004-04-17 at 20:26, Jamie Lokier wrote: > Are they intended to stop doubling at 3.2? The major timeout > thus happens after 22.3 seconds. > > Unsurprisingly, subsequent major timeouts take 44.1 seconds. Right... ...but since the timeout value is already capped at 60 seconds, this is not a major problem. It is pretty pointless to be talking about "predictable" or "consistent" behaviour when talking about a situation where we believe that the server has crashed. AFAICS, all we care about is to establish a predictable *lower limit*. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 7:03 ` Trond Myklebust @ 2004-04-18 23:22 ` Jamie Lokier 2004-04-19 15:38 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Jamie Lokier @ 2004-04-18 23:22 UTC (permalink / raw) To: Trond Myklebust; +Cc: Jamie Lokier, linux-kernel Trond Myklebust wrote: > On Sat, 2004-04-17 at 20:26, Jamie Lokier wrote: > > Are they intended to stop doubling at 3.2? The major timeout > > thus happens after 22.3 seconds. > > > > Unsurprisingly, subsequent major timeouts take 44.1 seconds. > > Right... ...but since the timeout value is already capped at 60 seconds, > this is not a major problem. It is pretty pointless to be talking about > "predictable" or "consistent" behaviour when talking about a situation > where we believe that the server has crashed. I agree, but would still prefer more consistent behaviour if it is easy -- and I explained how to do it, it's an easy algorithm. You don't respond to the other question: the doubling stopping at 3.2s. Is it intended? It goes againt a basic principle of congestion control. > AFAICS, all we care about is to establish a predictable *lower limit*. I agree that is the most important thing, and the old behaviour was probably the cause of problems for at least one poster on this thread. -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-18 23:22 ` Jamie Lokier @ 2004-04-19 15:38 ` Trond Myklebust 2004-04-19 16:19 ` Trond Myklebust 2004-04-20 0:09 ` Jamie Lokier 0 siblings, 2 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-19 15:38 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Sun, 2004-04-18 at 19:22, Jamie Lokier wrote: > I agree, but would still prefer more consistent behaviour if it is > easy -- and I explained how to do it, it's an easy algorithm. The reason I don't like it is that it continues to tie the major timeout to the resend timeouts. You've convinced me that they should not be the same thing. The other reason is that it only improves matters for the first request. Once we reset the RTO, all those other outstanding requests are anyway going to see an immediate discontinuity as their basic timeout jumps from 1ms to 700ms. So why go to all that trouble just for 1 request? > You don't respond to the other question: the doubling stopping at > 3.2s. Is it intended? It goes againt a basic principle of congestion > control. I can put it back in. It was partly another "consistency" issue that initially worried me, partly in order to avoid problems with overflow: If you have more than one outstanding request, then those that get scheduled after the first major timeout (when we reset the RTO estimator) will see a "jump". If the "retries" variable is too large, they will either jump straight over 60 seconds, and thus trigger the cap or they will end up at zero due to 32-bit overflow. I agree, though, that this is less of an issue. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-19 15:38 ` Trond Myklebust @ 2004-04-19 16:19 ` Trond Myklebust 2004-04-20 0:09 ` Jamie Lokier 1 sibling, 0 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-19 16:19 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 347 bytes --] On Mon, 2004-04-19 at 11:38, Trond Myklebust wrote: > On Sun, 2004-04-18 at 19:22, Jamie Lokier wrote: > > You don't respond to the other question: the doubling stopping at > > 3.2s. Is it intended? It goes againt a basic principle of congestion > > control. > > I can put it back in. Here's a patch that continues doubling. Cheers, Trond [-- Attachment #2: Type: text/plain, Size: 9195 bytes --] include/linux/sunrpc/xprt.h | 10 ++--- net/sunrpc/auth_gss/auth_gss.c | 2 - net/sunrpc/clnt.c | 4 -- net/sunrpc/timer.c | 1 net/sunrpc/xprt.c | 81 +++++++++++++++++++++++++---------------- 5 files changed, 57 insertions(+), 41 deletions(-) diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/include/linux/sunrpc/xprt.h linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h --- linux-2.6.6-rc1/include/linux/sunrpc/xprt.h 2004-04-17 23:01:09.000000000 -0400 +++ linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h 2004-04-19 11:57:32.000000000 -0400 @@ -69,8 +69,7 @@ extern unsigned int xprt_tcp_slot_table_ * This describes a timeout strategy */ struct rpc_timeout { - unsigned long to_current, /* current timeout */ - to_initval, /* initial timeout */ + unsigned long to_initval, /* initial timeout */ to_maxval, /* max timeout */ to_increment; /* if !exponential */ unsigned int to_retries; /* max # of retries */ @@ -85,7 +84,6 @@ struct rpc_rqst { * This is the user-visible part */ struct rpc_xprt * rq_xprt; /* RPC client */ - struct rpc_timeout rq_timeout; /* timeout parms */ struct xdr_buf rq_snd_buf; /* send buffer */ struct xdr_buf rq_rcv_buf; /* recv buffer */ @@ -103,6 +101,9 @@ struct rpc_rqst { struct xdr_buf rq_private_buf; /* The receive buffer * used in the softirq. */ + unsigned long rq_majortimeo; /* major timeout alarm */ + unsigned long rq_timeout; /* Current timeout value */ + unsigned int rq_retries; /* # of retries */ /* * For authentication (e.g. auth_des) */ @@ -115,7 +116,6 @@ struct rpc_rqst { u32 rq_bytes_sent; /* Bytes we have sent */ unsigned long rq_xtime; /* when transmitted */ - int rq_ntimeo; int rq_ntrans; }; #define rq_svec rq_snd_buf.head @@ -210,7 +210,7 @@ void xprt_reserve(struct rpc_task *); int xprt_prepare_transmit(struct rpc_task *); void xprt_transmit(struct rpc_task *); void xprt_receive(struct rpc_task *); -int xprt_adjust_timeout(struct rpc_timeout *); +int xprt_adjust_timeout(struct rpc_rqst *req); void xprt_release(struct rpc_task *); void xprt_connect(struct rpc_task *); int xprt_clear_backlog(struct rpc_xprt *); diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/auth_gss/auth_gss.c linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c --- linux-2.6.6-rc1/net/sunrpc/auth_gss/auth_gss.c 2004-04-17 23:00:57.000000000 -0400 +++ linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c 2004-04-19 11:57:32.000000000 -0400 @@ -736,10 +736,8 @@ static int gss_refresh(struct rpc_task *task) { struct rpc_clnt *clnt = task->tk_client; - struct rpc_xprt *xprt = task->tk_xprt; struct rpc_cred *cred = task->tk_msg.rpc_cred; - task->tk_timeout = xprt->timeout.to_current; if (!gss_cred_is_uptodate_ctx(cred)) return gss_upcall(clnt, task, cred); return 0; diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/clnt.c linux-2.6.6-01-soft/net/sunrpc/clnt.c --- linux-2.6.6-rc1/net/sunrpc/clnt.c 2004-04-17 23:00:47.000000000 -0400 +++ linux-2.6.6-01-soft/net/sunrpc/clnt.c 2004-04-19 11:57:32.000000000 -0400 @@ -788,13 +788,11 @@ static void call_timeout(struct rpc_task *task) { struct rpc_clnt *clnt = task->tk_client; - struct rpc_timeout *to = &task->tk_rqstp->rq_timeout; - if (xprt_adjust_timeout(to)) { + if (xprt_adjust_timeout(task->tk_rqstp) == 0) { dprintk("RPC: %4d call_timeout (minor)\n", task->tk_pid); goto retry; } - to->to_retries = clnt->cl_timeout.to_retries; dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid); if (RPC_IS_SOFT(task)) { diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/timer.c linux-2.6.6-01-soft/net/sunrpc/timer.c --- linux-2.6.6-rc1/net/sunrpc/timer.c 2004-04-17 23:01:20.000000000 -0400 +++ linux-2.6.6-01-soft/net/sunrpc/timer.c 2004-04-19 11:57:32.000000000 -0400 @@ -39,6 +39,7 @@ rpc_init_rtt(struct rpc_rtt *rt, unsigne for (i = 0; i < 5; i++) { rt->srtt[i] = init; rt->sdrtt[i] = RPC_RTO_INIT; + rt->ntimeouts[i] = 0; } } diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/xprt.c linux-2.6.6-01-soft/net/sunrpc/xprt.c --- linux-2.6.6-rc1/net/sunrpc/xprt.c 2004-04-17 23:01:07.000000000 -0400 +++ linux-2.6.6-01-soft/net/sunrpc/xprt.c 2004-04-19 11:58:03.000000000 -0400 @@ -352,35 +352,57 @@ xprt_adjust_cwnd(struct rpc_xprt *xprt, } /* + * Reset the major timeout value + */ +static void xprt_reset_majortimeo(struct rpc_rqst *req) +{ + struct rpc_timeout *to = &req->rq_xprt->timeout; + + req->rq_majortimeo = req->rq_timeout; + if (to->to_exponential) + req->rq_majortimeo <<= to->to_retries; + else + req->rq_majortimeo += to->to_increment * to->to_retries; + if (req->rq_majortimeo > to->to_maxval || req->rq_majortimeo == 0) + req->rq_majortimeo = to->to_maxval; + req->rq_majortimeo += jiffies; +} + +/* * Adjust timeout values etc for next retransmit */ -int -xprt_adjust_timeout(struct rpc_timeout *to) +int xprt_adjust_timeout(struct rpc_rqst *req) { - if (to->to_retries > 0) { + struct rpc_xprt *xprt = req->rq_xprt; + struct rpc_timeout *to = &xprt->timeout; + int status = 0; + + if (time_before(jiffies, req->rq_majortimeo)) { if (to->to_exponential) - to->to_current <<= 1; + req->rq_timeout <<= 1; else - to->to_current += to->to_increment; - if (to->to_maxval && to->to_current >= to->to_maxval) - to->to_current = to->to_maxval; + req->rq_timeout += to->to_increment; + if (to->to_maxval && req->rq_timeout >= to->to_maxval) + req->rq_timeout = to->to_maxval; + req->rq_retries++; + pprintk("RPC: %lu retrans\n", jiffies); } else { - if (to->to_exponential) - to->to_initval <<= 1; - else - to->to_initval += to->to_increment; - if (to->to_maxval && to->to_initval >= to->to_maxval) - to->to_initval = to->to_maxval; - to->to_current = to->to_initval; + req->rq_timeout = to->to_initval; + req->rq_retries = 0; + xprt_reset_majortimeo(req); + /* Reset the RTT counters == "slow start" */ + spin_lock_bh(&xprt->sock_lock); + rpc_init_rtt(req->rq_task->tk_client->cl_rtt, to->to_initval); + spin_unlock_bh(&xprt->sock_lock); + pprintk("RPC: %lu timeout\n", jiffies); + status = -ETIMEDOUT; } - if (!to->to_current) { - printk(KERN_WARNING "xprt_adjust_timeout: to_current = 0!\n"); - to->to_current = 5 * HZ; - } - pprintk("RPC: %lu %s\n", jiffies, - to->to_retries? "retrans" : "timeout"); - return to->to_retries-- > 0; + if (req->rq_timeout == 0) { + printk(KERN_WARNING "xprt_adjust_timeout: rq_timeout = 0!\n"); + req->rq_timeout = 5 * HZ; + } + return status; } /* @@ -1166,6 +1188,7 @@ xprt_transmit(struct rpc_task *task) /* Add request to the receive list */ list_add_tail(&req->rq_list, &xprt->recv); spin_unlock_bh(&xprt->sock_lock); + xprt_reset_majortimeo(req); } } else if (!req->rq_bytes_sent) return; @@ -1221,7 +1244,7 @@ xprt_transmit(struct rpc_task *task) if (!xprt_connected(xprt)) task->tk_status = -ENOTCONN; else if (test_bit(SOCK_NOSPACE, &xprt->sock->flags)) { - task->tk_timeout = req->rq_timeout.to_current; + task->tk_timeout = req->rq_timeout; rpc_sleep_on(&xprt->pending, task, NULL, NULL); } spin_unlock_bh(&xprt->sock_lock); @@ -1248,13 +1271,11 @@ xprt_transmit(struct rpc_task *task) if (!xprt->nocong) { int timer = task->tk_msg.rpc_proc->p_timer; task->tk_timeout = rpc_calc_rto(clnt->cl_rtt, timer); - task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer); - task->tk_timeout <<= clnt->cl_timeout.to_retries - - req->rq_timeout.to_retries; - if (task->tk_timeout > req->rq_timeout.to_maxval) - task->tk_timeout = req->rq_timeout.to_maxval; + task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer) + req->rq_retries; + if (task->tk_timeout > xprt->timeout.to_maxval || task->tk_timeout == 0) + task->tk_timeout = xprt->timeout.to_maxval; } else - task->tk_timeout = req->rq_timeout.to_current; + task->tk_timeout = req->rq_timeout; /* Don't race with disconnect */ if (!xprt_connected(xprt)) task->tk_status = -ENOTCONN; @@ -1324,7 +1345,7 @@ xprt_request_init(struct rpc_task *task, { struct rpc_rqst *req = task->tk_rqstp; - req->rq_timeout = xprt->timeout; + req->rq_timeout = xprt->timeout.to_initval; req->rq_task = task; req->rq_xprt = xprt; req->rq_xid = xprt_alloc_xid(xprt); @@ -1381,7 +1402,6 @@ xprt_default_timeout(struct rpc_timeout void xprt_set_timeout(struct rpc_timeout *to, unsigned int retr, unsigned long incr) { - to->to_current = to->to_initval = to->to_increment = incr; to->to_maxval = incr * retr; @@ -1446,7 +1466,6 @@ xprt_setup(int proto, struct sockaddr_in /* Set timeout parameters */ if (to) { xprt->timeout = *to; - xprt->timeout.to_current = to->to_initval; } else xprt_default_timeout(&xprt->timeout, xprt->prot); ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-19 15:38 ` Trond Myklebust 2004-04-19 16:19 ` Trond Myklebust @ 2004-04-20 0:09 ` Jamie Lokier 1 sibling, 0 replies; 54+ messages in thread From: Jamie Lokier @ 2004-04-20 0:09 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-kernel Trond Myklebust wrote: > > I agree, but would still prefer more consistent behaviour if it is > > easy -- and I explained how to do it, it's an easy algorithm. > > The reason I don't like it is that it continues to tie the major timeout > to the resend timeouts. You've convinced me that they should not be the > same thing. Sorry, I don't understand that paragraph. The algorithm I suggested _decouples_ the major timeout from the rtt estimate. Your algorithm strongly couples them. I'm not sure what you mean by saying the major timeout is "tied to the resend timeouts". Your current (patched) algorithm sets the major timeout to be in the range: [timeo << retrans, (timeo << retrans) * 2] The suggested algorithm sets the major timeout to be in the range: [timeo << (retrans+1), (timeo << (retrans+1)) + 2 * timeo) I.e. with retrans set to a new default of 5 (I think that's useful), the major timeout is approx [44.8, 46] instead of [22.4, 44.8]. I agree it's not the most important thing in the world, but it is nice to be able to fix the parameters and say that with the defaults, major timeout happens after about 45 seconds. You say you don't like it because major timeout is still tied to something. Could you explain what the ideal behaviour you have in mind is? Right now, with the patch, I think your intention is to have a fixed major timeout time, but it doesn't work like that. > The other reason is that it only improves matters for the first request. > Once we reset the RTO, all those other outstanding requests are anyway > going to see an immediate discontinuity as their basic timeout jumps > from 1ms to 700ms. Yes, that's the point: after a retransmits passes a threshold, we should no longer depend on the RTO estimate because it doesn't seem to be reliable. > So why go to all that trouble just for 1 request? Because it's visible behaviour with "soft" mounts. Someone unplugs the cable or the network is down, and you see the I/O errors after about 40 seconds. This is nicer than seeing them after an unknown period between 40 and 80 (or 20 and 40 depending on your settings). > It was partly another "consistency" issue that initially worried me, > partly in order to avoid problems with overflow: > If you have more than one outstanding request, then those that get > scheduled after the first major timeout (when we reset the RTO > estimator) will see a "jump". If the "retries" variable is too large, > they will either jump straight over 60 seconds, and thus trigger the cap > or they will end up at zero due to 32-bit overflow. Ah. So you keep track of the number of retries per request, and each time you send a request you set its timeout to (RTO << retries)? If you do, maybe that's why my algorithm seems over complicated, and you're concerned about overflows etc. Instead of counting retries, don't. You don't need a per-request retries counter. Instead: keep track of the request_timeout when the request was last issued. When retransmitting, compare that value against the global value (timeo << retrans). When a request times out and request_timeout >= (timeo << retrans), that's a major timeout. Otherwise you just check if request_timeout < timeo. If yes, double it. If no, set request_timeout = timeo << N for the smallest integer N such that it's an increase. And try again. Notice how that logic is based on constants: it's independent of RTO, and so outstanding requests aren't affected by changes in RTO. There's no jump, no overflow, and you can compute the key constant (timeo << retrans) when initialising: retrans isn't used by itself. -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <20040416190126.GB408@widomaker.com>]
[parent not found: <1082144608.2581.156.camel@lade.trondhjem.org>]
[parent not found: <20040417000353.GA3750@widomaker.com>]
* Re: NFS and kernel 2.6.x [not found] ` <20040417000353.GA3750@widomaker.com> @ 2004-04-17 5:28 ` Trond Myklebust 2004-04-17 17:55 ` Charles Shannon Hendrix 0 siblings, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2004-04-17 5:28 UTC (permalink / raw) To: Charles Shannon Hendrix; +Cc: linux-kernel On Fri, 2004-04-16 at 17:03, Charles Shannon Hendrix wrote: > > > > 2.6.x can cache a lot more data, and will tend to write it out in a more > > lazy fashion (i.e. only when the user requests it). That means the > > writes tend to occur in a more bursty fashion. > > That makes sense. > > Was there a specific reason for making NFS traffic bursty, or did it > just work out that way? It's an inevitable side-effect of the increased caching. If you are constantly writing out data, then you spread out the load a lot more than if you wait until the user actually requests a flush. On the other hand, it means that if your application reads/writs several times over the same page, then you only write it out once. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 5:28 ` Trond Myklebust @ 2004-04-17 17:55 ` Charles Shannon Hendrix 2004-04-17 18:55 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Charles Shannon Hendrix @ 2004-04-17 17:55 UTC (permalink / raw) To: Linux Kernel Fri, 16 Apr 2004 @ 22:28 -0700, Trond Myklebust said: > It's an inevitable side-effect of the increased caching. OK. That answers my question of: was making NFS bursty done on purpose. Answer: no. > If you are constantly writing out data, then you spread out the load > a lot more than if you wait until the user actually requests a flush. > On the other hand, it means that if your application reads/writs > several times over the same page, then you only write it out once. Usually, eliminating redundant writes in your application is a better optimization than relying on the OS to do it for you. I find bursty I/O is less desirable in most cases. -- shannon "AT" widomaker.com -- ["The trade of governing has always been monopolized by the most ignorant and the most rascally individuals of mankind. -- Thomas Paine"] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: NFS and kernel 2.6.x 2004-04-17 17:55 ` Charles Shannon Hendrix @ 2004-04-17 18:55 ` Trond Myklebust 0 siblings, 0 replies; 54+ messages in thread From: Trond Myklebust @ 2004-04-17 18:55 UTC (permalink / raw) To: Charles Shannon Hendrix; +Cc: Linux Kernel On Sat, 2004-04-17 at 10:55, Charles Shannon Hendrix wrote: > Usually, eliminating redundant writes in your application is a better > optimization than relying on the OS to do it for you. Fine. As long as you can convince all the other people sharing the same page cache to do so too. We're not talking about single applications here... Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <1Lql8-6O3-1@gated-at.bofh.it>]
[parent not found: <1LquO-6TK-5@gated-at.bofh.it>]
[parent not found: <1LqOg-76p-19@gated-at.bofh.it>]
[parent not found: <1LrKo-7Sn-21@gated-at.bofh.it>]
[parent not found: <1LtM3-12d-5@gated-at.bofh.it>]
[parent not found: <1Luf2-1kK-1@gated-at.bofh.it>]
[parent not found: <1LDBL-uY-3@gated-at.bofh.it>]
* Re: NFS and kernel 2.6.x [not found] ` <1LDBL-uY-3@gated-at.bofh.it> @ 2004-04-16 20:31 ` Andi Kleen 0 siblings, 0 replies; 54+ messages in thread From: Andi Kleen @ 2004-04-16 20:31 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel Marcelo Tosatti <marcelo.tosatti@cyclades.com> writes: > > Maaybe TCP should be the default then ? In case no one finds the reason > why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are > quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in > theory? Problem is that older linux knfsd (early 2.4) tend to crash or hang after some time when they have to talk TCP. But I guess it would be still a better default ... -Andi ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2004-04-20 0:09 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-16 1:14 NFS and kernel 2.6.x Charles Shannon Hendrix
2004-04-16 1:31 ` Trond Myklebust
2004-04-16 1:53 ` Andrew Morton
2004-04-16 2:54 ` Trond Myklebust
2004-04-16 4:59 ` Phil Oester
2004-04-16 5:29 ` Trond Myklebust
2004-04-16 7:13 ` Paul Wagland
2004-04-16 14:44 ` Marcelo Tosatti
2004-04-16 14:46 ` Marcelo Tosatti
2004-04-16 15:50 ` Trond Myklebust
2004-04-16 15:55 ` Dave Gilbert (Home)
2004-04-16 16:13 ` Trond Myklebust
2004-04-16 19:07 ` Daniel Egger
2004-04-17 4:56 ` Chris Friesen
2004-04-17 9:56 ` Daniel Egger
2004-04-17 5:24 ` Trond Myklebust
2004-04-17 14:15 ` Daniel Egger
2004-04-16 19:11 ` Charles Shannon Hendrix
2004-04-17 16:44 ` Matthias Urlichs
2004-04-17 18:15 ` Trond Myklebust
2004-04-17 18:32 ` Marc Singer
2004-04-17 18:58 ` Trond Myklebust
2004-04-17 19:01 ` Marc Singer
2004-04-17 19:09 ` Trond Myklebust
2004-04-17 19:19 ` Russell King
2004-04-18 2:51 ` Trond Myklebust
2004-04-19 16:39 ` Trond Myklebust
2004-04-19 21:10 ` Trond Myklebust
2004-04-17 22:22 ` Marc Singer
2004-04-18 0:57 ` Trond Myklebust
2004-04-18 5:01 ` Marc Singer
2004-04-18 6:36 ` Chris Friesen
2004-04-18 7:56 ` Russell King
2004-04-18 17:31 ` Marc Singer
2004-04-17 19:01 ` Daniel Egger
2004-04-17 20:22 ` Marc Singer
2004-04-18 11:14 ` Daniel Egger
2004-04-19 9:06 ` Helge Hafting
2004-04-16 9:03 ` Jamie Lokier
2004-04-16 15:55 ` Trond Myklebust
2004-04-16 18:48 ` Jamie Lokier
2004-04-16 19:06 ` Trond Myklebust
2004-04-16 19:39 ` Jamie Lokier
2004-04-17 22:32 ` Trond Myklebust
2004-04-18 3:26 ` Jamie Lokier
2004-04-18 7:03 ` Trond Myklebust
2004-04-18 23:22 ` Jamie Lokier
2004-04-19 15:38 ` Trond Myklebust
2004-04-19 16:19 ` Trond Myklebust
2004-04-20 0:09 ` Jamie Lokier
[not found] ` <20040416190126.GB408@widomaker.com>
[not found] ` <1082144608.2581.156.camel@lade.trondhjem.org>
[not found] ` <20040417000353.GA3750@widomaker.com>
2004-04-17 5:28 ` Trond Myklebust
2004-04-17 17:55 ` Charles Shannon Hendrix
2004-04-17 18:55 ` Trond Myklebust
[not found] <1Lql8-6O3-1@gated-at.bofh.it>
[not found] ` <1LquO-6TK-5@gated-at.bofh.it>
[not found] ` <1LqOg-76p-19@gated-at.bofh.it>
[not found] ` <1LrKo-7Sn-21@gated-at.bofh.it>
[not found] ` <1LtM3-12d-5@gated-at.bofh.it>
[not found] ` <1Luf2-1kK-1@gated-at.bofh.it>
[not found] ` <1LDBL-uY-3@gated-at.bofh.it>
2004-04-16 20:31 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox