* FLushing cached writes in nfs_getattr() and stat() delay
@ 2008-11-06 15:34 Alex Sidorenko
2008-11-06 16:40 ` Chuck Lever
2008-11-06 18:49 ` Trond Myklebust
0 siblings, 2 replies; 8+ messages in thread
From: Alex Sidorenko @ 2008-11-06 15:34 UTC (permalink / raw)
To: linux-nfs
Hello,
I am an HP engineer participating in L3 Linux support. Recently we have found
that current design of nfs_getattr() might create huge delays in stat() on
the file we are writing to (this is important for big files only, >2Gb).
The problem
-----------
Assuming that /nfs is an NFS-mounted FS:
1. In one shell, start
$ dd if=/dev/zero of=/nfs/dir/big bs=1G count=20
2. In another shell, start
$ ls -l /nfs/dir
or
$ ls -l /nfs/dir/big
'ls' does not return until the whole /nfs/dir/big is written.
Analysis
--------
Kernel 2.6.16 has introduced the following change in nfs_getattr():
NFS: Make stat() return updated mtimes after a write()
The SuS states that a call to write() will cause mtime to be updated on
the file. In order to satisfy that requirement, we need to flush out
any cached writes in nfs_getattr().
Speed things up slightly by not committing the writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
+ /* Flush out writes to the server in order to update c/mtime */
+ nfs_sync_inode(inode, 0, 0, FLUSH_WAIT|FLUSH_NOCOMMIT);
Then later:
2.6.16-rc6:
http://www.linux-nfs.org/Linux-2.6.x/2.6.16-rc6/linux-2.6.16-99-fix_nfs_sync_inode_race.dif
/* Flush out writes to the server in order to update c/mtime */
- nfs_sync_inode(inode, 0, 0, FLUSH_WAIT|FLUSH_NOCOMMIT);
+ nfs_sync_inode_wait(inode, 0, 0, FLUSH_NOCOMMIT);
I understand the reasoning behind that. From application point of view, NFS
file/directory should behave the same as on local FS. If we have queued many
writes, without this patch stat() will return incorrect results, both for
mtime and file length. Some applications may depend on stat() results being
correct.
At the same time, the fact that we have to wait forever while copying big
files and doing 'ls -l' on that directory (or on the file being written) is
not very good either (two HP customers have complained about this after
migrating from RHEL4 to RHEL5).
The problem is still there in 2.6.27. I am not sure what can be done to both
reduce the stat() delay and guarantee reasonable stat() results.
It is interesting that with 'noac' stat() returns much faster (just 1-3s
delay).
Best regards,
Alex
--
------------------------------------------------------------------
Alexandre Sidorenko email: asid@hp.com
Global Solutions Engineering: Unix Networking
Hewlett-Packard (Canada)
------------------------------------------------------------------
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: FLushing cached writes in nfs_getattr() and stat() delay 2008-11-06 15:34 FLushing cached writes in nfs_getattr() and stat() delay Alex Sidorenko @ 2008-11-06 16:40 ` Chuck Lever 2008-11-06 18:49 ` Trond Myklebust 1 sibling, 0 replies; 8+ messages in thread From: Chuck Lever @ 2008-11-06 16:40 UTC (permalink / raw) To: Alex Sidorenko; +Cc: linux-nfs Hi Alex- On Nov 6, 2008, at Nov 6, 2008, 10:34 AM, Alex Sidorenko wrote: > Hello, > > I am an HP engineer participating in L3 Linux support. Recently we > have found > that current design of nfs_getattr() might create huge delays in > stat() on > the file we are writing to (this is important for big files only, > >2Gb). > > The problem > ----------- > > Assuming that /nfs is an NFS-mounted FS: > > 1. In one shell, start > $ dd if=/dev/zero of=/nfs/dir/big bs=1G count=20 > > 2. In another shell, start > > $ ls -l /nfs/dir > > or > > $ ls -l /nfs/dir/big > > 'ls' does not return until the whole /nfs/dir/big is written. > > Analysis > -------- > > Kernel 2.6.16 has introduced the following change in nfs_getattr(): > > > NFS: Make stat() return updated mtimes after a write() > > The SuS states that a call to write() will cause mtime to be > updated on > the file. In order to satisfy that requirement, we need to flush > out > any cached writes in nfs_getattr(). > Speed things up slightly by not committing the writes. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > > + /* Flush out writes to the server in order to update c/mtime */ > + nfs_sync_inode(inode, 0, 0, FLUSH_WAIT|FLUSH_NOCOMMIT); > > Then later: > > 2.6.16-rc6: > > http://www.linux-nfs.org/Linux-2.6.x/2.6.16-rc6/linux-2.6.16-99-fix_nfs_sync_inode_race.dif > > /* Flush out writes to the server in order to update c/mtime */ > - nfs_sync_inode(inode, 0, 0, FLUSH_WAIT|FLUSH_NOCOMMIT); > + nfs_sync_inode_wait(inode, 0, 0, FLUSH_NOCOMMIT); > > > I understand the reasoning behind that. From application point of > view, NFS > file/directory should behave the same as on local FS. If we have > queued many > writes, without this patch stat() will return incorrect results, > both for > mtime and file length. Some applications may depend on stat() > results being > correct. > > At the same time, the fact that we have to wait forever while > copying big > files and doing 'ls -l' on that directory (or on the file being > written) is > not very good either (two HP customers have complained about this > after > migrating from RHEL4 to RHEL5). > > The problem is still there in 2.6.27. I am not sure what can be done > to both > reduce the stat() delay and guarantee reasonable stat() results. The goal is to meet the POSIX requirement that the mtime of the returned stat(2) results must reflect the mtime of the latest application write(2) request. If the client is caching writes, then those must be flushed to the server first because only the server determines the file's mtime. If the client limited its write cache to a few dozen megabytes, the delay during stat(2) would be nearly unnoticeable. > It is interesting that with 'noac' stat() returns much faster (just > 1-3s > delay). That's because "noac" never caches writes on the client. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: FLushing cached writes in nfs_getattr() and stat() delay 2008-11-06 15:34 FLushing cached writes in nfs_getattr() and stat() delay Alex Sidorenko 2008-11-06 16:40 ` Chuck Lever @ 2008-11-06 18:49 ` Trond Myklebust [not found] ` <1225997396.387.30.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 8+ messages in thread From: Trond Myklebust @ 2008-11-06 18:49 UTC (permalink / raw) To: Alex Sidorenko; +Cc: linux-nfs On Thu, 2008-11-06 at 10:34 -0500, Alex Sidorenko wrote: > I understand the reasoning behind that. From application point of view, NFS > file/directory should behave the same as on local FS. If we have queued many > writes, without this patch stat() will return incorrect results, both for > mtime and file length. Some applications may depend on stat() results being > correct. > > At the same time, the fact that we have to wait forever while copying big > files and doing 'ls -l' on that directory (or on the file being written) is > not very good either (two HP customers have complained about this after > migrating from RHEL4 to RHEL5). In order to relax that requirement, we'd have to introduce some mechanism for the application to notify the filesystem that they don't care about strictly correct c/mtimes. As you noted above, returning incorrect mtimes may trip up some applications (backup applications, and mail readers are a couple of business critical cases that come to mind). > The problem is still there in 2.6.27. I am not sure what can be done to both > reduce the stat() delay and guarantee reasonable stat() results. > > It is interesting that with 'noac' stat() returns much faster (just 1-3s > delay). That would be because 'noac' enforces synchronous writes. If you don't care about the degraded write performance, you can do the same thing without all the extra getattr clutter that noac introduces, by simply mounting with -osync. Cheers Trond ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <1225997396.387.30.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: FLushing cached writes in nfs_getattr() and stat() delay [not found] ` <1225997396.387.30.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2008-11-06 19:15 ` Chuck Lever 2008-11-06 19:22 ` Alex Sidorenko 1 sibling, 0 replies; 8+ messages in thread From: Chuck Lever @ 2008-11-06 19:15 UTC (permalink / raw) To: Trond Myklebust; +Cc: Alex Sidorenko, linux-nfs On Nov 6, 2008, at Nov 6, 2008, 1:49 PM, Trond Myklebust wrote: > On Thu, 2008-11-06 at 10:34 -0500, Alex Sidorenko wrote: >> I understand the reasoning behind that. From application point of >> view, NFS >> file/directory should behave the same as on local FS. If we have >> queued many >> writes, without this patch stat() will return incorrect results, >> both for >> mtime and file length. Some applications may depend on stat() >> results being >> correct. >> >> At the same time, the fact that we have to wait forever while >> copying big >> files and doing 'ls -l' on that directory (or on the file being >> written) is >> not very good either (two HP customers have complained about this >> after >> migrating from RHEL4 to RHEL5). > > In order to relax that requirement, we'd have to introduce some > mechanism for the application to notify the filesystem that they don't > care about strictly correct c/mtimes. As you noted above, returning > incorrect mtimes may trip up some applications (backup applications, > and > mail readers are a couple of business critical cases that come to > mind). I thought the preferred way to address this issue was to limit the per- file write cache size on the client. That is effectively what you do with "-osync" as you suggest below. Especially on big systems, the client will delay writes until the cows come home. Then someone does a stat(2) and the lights dim... Really the client shouldn't need to cache that aggressively, and there are good reasons to keep a cap on it. >> The problem is still there in 2.6.27. I am not sure what can be >> done to both >> reduce the stat() delay and guarantee reasonable stat() results. >> >> It is interesting that with 'noac' stat() returns much faster (just >> 1-3s >> delay). > > That would be because 'noac' enforces synchronous writes. If you don't > care about the degraded write performance, you can do the same thing > without all the extra getattr clutter that noac introduces, by simply > mounting with -osync. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: FLushing cached writes in nfs_getattr() and stat() delay [not found] ` <1225997396.387.30.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2008-11-06 19:15 ` Chuck Lever @ 2008-11-06 19:22 ` Alex Sidorenko 2008-11-06 19:32 ` Trond Myklebust 2008-11-06 19:45 ` Chuck Lever 1 sibling, 2 replies; 8+ messages in thread From: Alex Sidorenko @ 2008-11-06 19:22 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs@vger.kernel.org On November 6, 2008 01:49:56 pm Trond Myklebust wrote: > On Thu, 2008-11-06 at 10:34 -0500, Alex Sidorenko wrote: > > I understand the reasoning behind that. From application point of view, > > NFS file/directory should behave the same as on local FS. If we have > > queued many writes, without this patch stat() will return incorrect > > results, both for mtime and file length. Some applications may depend on > > stat() results being correct. > > > > At the same time, the fact that we have to wait forever while copying big > > files and doing 'ls -l' on that directory (or on the file being written) > > is not very good either (two HP customers have complained about this > > after migrating from RHEL4 to RHEL5). > > In order to relax that requirement, we'd have to introduce some > mechanism for the application to notify the filesystem that they don't > care about strictly correct c/mtimes. As you noted above, returning > incorrect mtimes may trip up some applications (backup applications, and > mail readers are a couple of business critical cases that come to mind). > > > The problem is still there in 2.6.27. I am not sure what can be done to > > both reduce the stat() delay and guarantee reasonable stat() results. > > > > It is interesting that with 'noac' stat() returns much faster (just 1-3s > > delay). > > That would be because 'noac' enforces synchronous writes. If you don't > care about the degraded write performance, you can do the same thing > without all the extra getattr clutter that noac introduces, by simply > mounting with -osync. Hi Trond, In my experiments on 2.6.24 I saw practically no performance degradation while doing 'cp' of a 4Gb file with 'noac', with 'sync' the performance is really bad. And writes are still definitely ASYNC, here is what I see using Systemtap script on entry to rpc_execute from /etc/mtab: cats:/data /mnt nfs rw,udp,noac,hard,intr,addr=192.168.0.33 0 0 $ dd if=/dev/zero of=/mnt/win/big bs=100m count=1 >From stap output: rpc_execute p_proc=7 WRITE qlen=0 prio=1 flags=0x1 --ts=4 rpc_execute p_proc=7 WRITE qlen=0 prio=1 flags=0x1 ... So we still have RPC_TASK_ASYNC set. I did not check experimentally 'noac' on 2.6.27 but I still think that 'noac' does not make writes sync. nfs_commit_rpcsetup() still sets RPC_TASK_ASYNC by default and I don't see NFS_MOUNT_NOACL setting FLUSH_SYNC anywhere. So I still don't quite understand why 'noac' eliminates the delay. Chuck Lever says that "noac" never caches writes on the client. Printing xprt->backlog->qlen in my experiments I can still see a significant backlog even with 'noac', e.g. --ts=32 rpc_execute p_proc=7 WRITE qlen=3086 prio=1 flags=0x1 but 'stat' delay is just 1-2s. Regards, Alex -- ------------------------------------------------------------------ Alexandre Sidorenko email: asid@hp.com Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) ------------------------------------------------------------------ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: FLushing cached writes in nfs_getattr() and stat() delay 2008-11-06 19:22 ` Alex Sidorenko @ 2008-11-06 19:32 ` Trond Myklebust 2008-11-06 19:45 ` Chuck Lever 1 sibling, 0 replies; 8+ messages in thread From: Trond Myklebust @ 2008-11-06 19:32 UTC (permalink / raw) To: Alex Sidorenko; +Cc: linux-nfs@vger.kernel.org On Thu, 2008-11-06 at 14:22 -0500, Alex Sidorenko wrote: > On November 6, 2008 01:49:56 pm Trond Myklebust wrote: > > On Thu, 2008-11-06 at 10:34 -0500, Alex Sidorenko wrote: > > > I understand the reasoning behind that. From application point of view, > > > NFS file/directory should behave the same as on local FS. If we have > > > queued many writes, without this patch stat() will return incorrect > > > results, both for mtime and file length. Some applications may depend on > > > stat() results being correct. > > > > > > At the same time, the fact that we have to wait forever while copying big > > > files and doing 'ls -l' on that directory (or on the file being written) > > > is not very good either (two HP customers have complained about this > > > after migrating from RHEL4 to RHEL5). > > > > In order to relax that requirement, we'd have to introduce some > > mechanism for the application to notify the filesystem that they don't > > care about strictly correct c/mtimes. As you noted above, returning > > incorrect mtimes may trip up some applications (backup applications, and > > mail readers are a couple of business critical cases that come to mind). > > > > > The problem is still there in 2.6.27. I am not sure what can be done to > > > both reduce the stat() delay and guarantee reasonable stat() results. > > > > > > It is interesting that with 'noac' stat() returns much faster (just 1-3s > > > delay). > > > > That would be because 'noac' enforces synchronous writes. If you don't > > care about the degraded write performance, you can do the same thing > > without all the extra getattr clutter that noac introduces, by simply > > mounting with -osync. > > Hi Trond, > > In my experiments on 2.6.24 I saw practically no performance degradation while > doing 'cp' of a 4Gb file with 'noac', with 'sync' the performance is really > bad. And writes are still definitely ASYNC, here is what I see using > Systemtap script on entry to rpc_execute > > from /etc/mtab: > > cats:/data /mnt nfs rw,udp,noac,hard,intr,addr=192.168.0.33 0 0 > > $ dd if=/dev/zero of=/mnt/win/big bs=100m count=1 > > From stap output: > rpc_execute p_proc=7 WRITE qlen=0 prio=1 flags=0x1 > --ts=4 > rpc_execute p_proc=7 WRITE qlen=0 prio=1 flags=0x1 > ... > > So we still have RPC_TASK_ASYNC set. > > I did not check experimentally 'noac' on 2.6.27 but I still think that 'noac' > does not make writes sync. nfs_commit_rpcsetup() still sets RPC_TASK_ASYNC by > default and I don't see NFS_MOUNT_NOACL setting FLUSH_SYNC anywhere. I repeat: 'noac' automatically sets the 'sync' flag, as you can see below: # mount -t nfs -onoac fas960-1:/vol/san /mnt # cat /proc/mounts | grep nfs fas960-1:/vol/san /mnt nfs rw,sync,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,nointr,noac,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=141.211.133.16,mountvers=3,mountproto=tcp,addr=141.211.133.16 0 0 Trond ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: FLushing cached writes in nfs_getattr() and stat() delay 2008-11-06 19:22 ` Alex Sidorenko 2008-11-06 19:32 ` Trond Myklebust @ 2008-11-06 19:45 ` Chuck Lever 2008-11-06 19:55 ` Alex Sidorenko 1 sibling, 1 reply; 8+ messages in thread From: Chuck Lever @ 2008-11-06 19:45 UTC (permalink / raw) To: Alex Sidorenko; +Cc: Trond Myklebust, linux-nfs@vger.kernel.org On Nov 6, 2008, at Nov 6, 2008, 2:22 PM, Alex Sidorenko wrote: > On November 6, 2008 01:49:56 pm Trond Myklebust wrote: >> On Thu, 2008-11-06 at 10:34 -0500, Alex Sidorenko wrote: >>> I understand the reasoning behind that. From application point of >>> view, >>> NFS file/directory should behave the same as on local FS. If we have >>> queued many writes, without this patch stat() will return incorrect >>> results, both for mtime and file length. Some applications may >>> depend on >>> stat() results being correct. >>> >>> At the same time, the fact that we have to wait forever while >>> copying big >>> files and doing 'ls -l' on that directory (or on the file being >>> written) >>> is not very good either (two HP customers have complained about this >>> after migrating from RHEL4 to RHEL5). >> >> In order to relax that requirement, we'd have to introduce some >> mechanism for the application to notify the filesystem that they >> don't >> care about strictly correct c/mtimes. As you noted above, returning >> incorrect mtimes may trip up some applications (backup >> applications, and >> mail readers are a couple of business critical cases that come to >> mind). >> >>> The problem is still there in 2.6.27. I am not sure what can be >>> done to >>> both reduce the stat() delay and guarantee reasonable stat() >>> results. >>> >>> It is interesting that with 'noac' stat() returns much faster >>> (just 1-3s >>> delay). >> >> That would be because 'noac' enforces synchronous writes. If you >> don't >> care about the degraded write performance, you can do the same thing >> without all the extra getattr clutter that noac introduces, by simply >> mounting with -osync. > > Hi Trond, > > In my experiments on 2.6.24 I saw practically no performance > degradation while > doing 'cp' of a 4Gb file with 'noac', with 'sync' the performance is > really > bad. And writes are still definitely ASYNC, here is what I see using > Systemtap script on entry to rpc_execute There's a difference between an asynchronous RPC request, and an asynchronous write request. An async RPC means the process doesn't wait for the request to finish, it can perform other housekeeping. An async write means that the client delays sending NFS writes, maintaining the dirty data in its memory. It can send the NFS write requests by means of an async RPC if it wishes. A synchronous write means that the client will block the application until the server has replied that the dirty data is on the server's disk. > from /etc/mtab: > > cats:/data /mnt nfs rw,udp,noac,hard,intr,addr=192.168.0.33 0 0 > > $ dd if=/dev/zero of=/mnt/win/big bs=100m count=1 > > From stap output: > rpc_execute p_proc=7 WRITE qlen=0 prio=1 flags=0x1 > --ts=4 > rpc_execute p_proc=7 WRITE qlen=0 prio=1 flags=0x1 > ... > > So we still have RPC_TASK_ASYNC set. See above. > I did not check experimentally 'noac' on 2.6.27 but I still think > that 'noac' > does not make writes sync. nfs_commit_rpcsetup() still sets > RPC_TASK_ASYNC by > default and I don't see NFS_MOUNT_NOACL setting FLUSH_SYNC anywhere. Again, RPC_TASK_ASYNC has nothing to do with whether the application is blocked until the server says the write is permanent. > So I still don't quite understand why 'noac' eliminates the delay. > Chuck Lever > says that "noac" never caches writes on the client. Printing > xprt->backlog->qlen in my experiments I can still see a significant > backlog > even with 'noac', e.g. > > --ts=32 > rpc_execute p_proc=7 WRITE qlen=3086 prio=1 flags=0x1 > > but 'stat' delay is just 1-2s. > > Regards, > Alex > > -- > ------------------------------------------------------------------ > Alexandre Sidorenko email: asid@hp.com > Global Solutions Engineering: Unix Networking > Hewlett-Packard (Canada) > ------------------------------------------------------------------ > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: FLushing cached writes in nfs_getattr() and stat() delay 2008-11-06 19:45 ` Chuck Lever @ 2008-11-06 19:55 ` Alex Sidorenko 0 siblings, 0 replies; 8+ messages in thread From: Alex Sidorenko @ 2008-11-06 19:55 UTC (permalink / raw) To: Chuck Lever; +Cc: Trond Myklebust, linux-nfs@vger.kernel.org On November 6, 2008 02:45:34 pm Chuck Lever wrote: > There's a difference between an asynchronous RPC request, and an > asynchronous write request. > > An async RPC means the process doesn't wait for the request to finish= , > it can perform other housekeeping. > > An async write means that the client delays sending NFS writes, > maintaining the dirty data in its memory. =A0It can send the NFS writ= e > requests by means of an async RPC if it wishes. =A0A synchronous writ= e > means that the client will block the application until the server has > replied that the dirty data is on the server's disk. Thank you for clarification. I knew what RPC-async means but for some r= eason=20 forgot what SYNC-write means, sorry for confusion. And yes, I should ha= ve=20 looked at /proc/mounts instead of /etc/mtab Regards, Alex --=20 ------------------------------------------------------------------ Alexandre Sidorenko email: asid@hp.com Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) ------------------------------------------------------------------ ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-11-06 19:55 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-06 15:34 FLushing cached writes in nfs_getattr() and stat() delay Alex Sidorenko
2008-11-06 16:40 ` Chuck Lever
2008-11-06 18:49 ` Trond Myklebust
[not found] ` <1225997396.387.30.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2008-11-06 19:15 ` Chuck Lever
2008-11-06 19:22 ` Alex Sidorenko
2008-11-06 19:32 ` Trond Myklebust
2008-11-06 19:45 ` Chuck Lever
2008-11-06 19:55 ` Alex Sidorenko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.