* NFSv4/pNFS possible POSIX I/O API standards
@ 2006-11-28 4:34 Gary Grider
2006-11-28 5:54 ` Christoph Hellwig
2006-11-28 15:08 ` NFSv4/pNFS possible POSIX I/O API standards Matthew Wilcox
0 siblings, 2 replies; 124+ messages in thread
From: Gary Grider @ 2006-11-28 4:34 UTC (permalink / raw)
To: linux-fsdevel
>
>NFS developers, a group of people from the High End Computing
>Interagency Working Group File
>Systems and I/O (HECIWG FSIO), which is a funding oversight group
>for file systems and
>storage government funded research, has formed a project to extend
>the POSIX I/O API.
>The extensions have mostly to do with distributed computing/cluster
>computing/high end computing
>extensions.
>
>Things like
>openg() - on process opens a file and gets a key that is passed to
>lots of processes which
>use the key to get a handle (great for thousands of processes opening a file)
>readx/writex - scattergather readwrite - more appropriate and
>complete than the real time extended read/write
>statlite() - asking for stat info without requiring completely
>accurate info like dates and sizes. This is good
>for running stat against a file that is open by hundreds of
>processes which currently forces callbacks
>and the hundreds of processes to flush.
>
>Some of these things might be useful for pNFS and NFSv4.
>
>etc.
>
>In talking to Andy Adamson, we realized that NFS ACL's are a good
>candidate for POSIX standardization.
>There may be other items that might be good for standardization as well.
>
>The HECIWG FSIO POSIX team has already gotten the wheels rolling and
>has gotten agreement
>to do the standardization effort within the OpenGroup. We have an
>official project in the platform
>group in the OpenGroup and will publish man pages for review and
>possible standardization
>as an extension to POSIX.
>The website is at
>http://www.opengroup.org/platform/hecewg/
>
>There are man pages for most of the proposed extensions at that
>site. We will be putting more
>out there in a few weeks.
>
>We in the HECIWG would welcome putting forth any NFS related
>calls/utilities in our POSIX
>extension effort.
>We would need someone to work with in the NFS community to give us
>the man pages
>and point at example implementations.
>We would be happy to do the legwork of getting the documents done
>and the like.
>
>I also would be happy to add any of you to the POSIX HEC Extensions
>mailing list etc.
>if you want to monitor this effort.
>
>Andy and I felt that since we have the official POSIX Extensions
>project spun up, getting a few things
>from the NFSv4/pNFS community into the POSIX Standard would be nice
>way to get leverage.
>
>I am very interested in your thoughts.
>Thanks
>Gary Grider
>Los Alamos National Lab
>HECIWG
^ permalink raw reply [flat|nested] 124+ messages in thread* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 4:34 NFSv4/pNFS possible POSIX I/O API standards Gary Grider @ 2006-11-28 5:54 ` Christoph Hellwig 2006-11-28 10:54 ` Andreas Dilger 2006-11-29 9:04 ` Christoph Hellwig 2006-11-28 15:08 ` NFSv4/pNFS possible POSIX I/O API standards Matthew Wilcox 1 sibling, 2 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-11-28 5:54 UTC (permalink / raw) To: Gary Grider; +Cc: linux-fsdevel What crack do you guys have been smoking? On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote: > > > > >NFS developers, a group of people from the High End Computing > >Interagency Working Group File > >Systems and I/O (HECIWG FSIO), which is a funding oversight group > >for file systems and > >storage government funded research, has formed a project to extend > >the POSIX I/O API. > >The extensions have mostly to do with distributed computing/cluster > >computing/high end computing > >extensions. > > > >Things like > >openg() - on process opens a file and gets a key that is passed to > >lots of processes which > >use the key to get a handle (great for thousands of processes opening a > >file) > >readx/writex - scattergather readwrite - more appropriate and > >complete than the real time extended read/write > >statlite() - asking for stat info without requiring completely > >accurate info like dates and sizes. This is good > >for running stat against a file that is open by hundreds of > >processes which currently forces callbacks > >and the hundreds of processes to flush. > > > >Some of these things might be useful for pNFS and NFSv4. > > > >etc. > > > >In talking to Andy Adamson, we realized that NFS ACL's are a good > >candidate for POSIX standardization. > >There may be other items that might be good for standardization as well. > > > >The HECIWG FSIO POSIX team has already gotten the wheels rolling and > >has gotten agreement > >to do the standardization effort within the OpenGroup. We have an > >official project in the platform > >group in the OpenGroup and will publish man pages for review and > >possible standardization > >as an extension to POSIX. > >The website is at > >http://www.opengroup.org/platform/hecewg/ > > > >There are man pages for most of the proposed extensions at that > >site. We will be putting more > >out there in a few weeks. > > > >We in the HECIWG would welcome putting forth any NFS related > >calls/utilities in our POSIX > >extension effort. > >We would need someone to work with in the NFS community to give us > >the man pages > >and point at example implementations. > >We would be happy to do the legwork of getting the documents done > >and the like. > > > >I also would be happy to add any of you to the POSIX HEC Extensions > >mailing list etc. > >if you want to monitor this effort. > > > >Andy and I felt that since we have the official POSIX Extensions > >project spun up, getting a few things > >from the NFSv4/pNFS community into the POSIX Standard would be nice > >way to get leverage. > > > >I am very interested in your thoughts. > >Thanks > >Gary Grider > >Los Alamos National Lab > >HECIWG > > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ---end quoted text--- ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 5:54 ` Christoph Hellwig @ 2006-11-28 10:54 ` Andreas Dilger 2006-11-28 11:28 ` Anton Altaparmakov ` (2 more replies) 2006-11-29 9:04 ` Christoph Hellwig 1 sibling, 3 replies; 124+ messages in thread From: Andreas Dilger @ 2006-11-28 10:54 UTC (permalink / raw) To: Gary Grider; +Cc: linux-fsdevel, Christoph Hellewig On Nov 28, 2006 05:54 +0000, Christoph Hellwig wrote: > What crack do you guys have been smoking? As usual, Christoph is a model of diplomacy :-). > On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote: > > >readx/writex - scattergather readwrite - more appropriate and > > >complete than the real time extended read/write IMHO, this is a logical extension to readv/writev. It allows a single readx/writex syscall to specify different targets in the file instead of needing separate syscalls. So, for example, a single syscall could be given to dump a sparse corefile or a compiled+linked binary, allowing the filesystem to optimize the allocations instead of getting essentially random IO from several separate syscalls. > > >statlite() - asking for stat info without requiring completely > > >accurate info like dates and sizes. This is good > > >for running stat against a file that is open by hundreds of > > >processes which currently forces callbacks > > >and the hundreds of processes to flush. This is a big win for clustered filesystems. Some "stat" items are a lot more work to gather than others, and if an application (e.g. "ls --color" which is default on all distros) doesn't need anything except the file mode to print "*" and color an executable green it is a waste to gather the remaining ones. My objection to the current proposal is that it should be possible to _completely_ specify which fields are required and which are not, instead of having a split "required" and "optional" section to the stat data. In some implementations, it might be desirable to only find the file blocks (e.g. ls -s, du -s) and not the owner, links, metadata, so why implement a half-baked version of a "lite" stat()? Also, why pass the "st_litemask" as a parameter set in the struct (which would have to be reset on each call) instead of as a parameter to the function (makes the calling convention much clearer)? int statlite(const char *fname, struct stat *buf, unsigned long *statflags); [ readdirplus not referenced ] It would be prudent, IMHO, that if we are proposing statlite() and readdirplus() syscalls, that the readdirplus() syscall be implemented as a special case of statlite(). It avoids the need for yet another version in the future "readdirpluslite()" or whatever... Namely readdirplus() takes a "statflags" paremeter (per above) so that the dirent_plus data only has to retrieve the required stat data (e.g. ls -iR only needs inode number) and not all of it. Each returned stat has a mask of valid fields in it, as e.g. some items might be in cache already and can contain more information than others. > > >The website is at > > >http://www.opengroup.org/platform/hecewg/ > > >We in the HECIWG would welcome putting forth any NFS related ... Strange, group is called HECIWG, website is "hecewg"? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 10:54 ` Andreas Dilger @ 2006-11-28 11:28 ` Anton Altaparmakov 2006-11-28 20:17 ` Russell Cattelan 2006-11-28 23:28 ` Wendy Cheng 2 siblings, 0 replies; 124+ messages in thread From: Anton Altaparmakov @ 2006-11-28 11:28 UTC (permalink / raw) To: Andreas Dilger; +Cc: Gary Grider, linux-fsdevel, Christoph Hellewig On Tue, 2006-11-28 at 03:54 -0700, Andreas Dilger wrote: > On Nov 28, 2006 05:54 +0000, Christoph Hellwig wrote: > > What crack do you guys have been smoking? > > As usual, Christoph is a model of diplomacy :-). > > > On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote: > > > >readx/writex - scattergather readwrite - more appropriate and > > > >complete than the real time extended read/write > > IMHO, this is a logical extension to readv/writev. It allows a single > readx/writex syscall to specify different targets in the file instead of > needing separate syscalls. So, for example, a single syscall could be > given to dump a sparse corefile or a compiled+linked binary, allowing > the filesystem to optimize the allocations instead of getting essentially > random IO from several separate syscalls. > > > > >statlite() - asking for stat info without requiring completely > > > >accurate info like dates and sizes. This is good > > > >for running stat against a file that is open by hundreds of > > > >processes which currently forces callbacks > > > >and the hundreds of processes to flush. > > This is a big win for clustered filesystems. Some "stat" items are > a lot more work to gather than others, and if an application (e.g. > "ls --color" which is default on all distros) doesn't need anything > except the file mode to print "*" and color an executable green it > is a waste to gather the remaining ones. > > My objection to the current proposal is that it should be possible > to _completely_ specify which fields are required and which are not, > instead of having a split "required" and "optional" section to the > stat data. In some implementations, it might be desirable to only > find the file blocks (e.g. ls -s, du -s) and not the owner, links, > metadata, so why implement a half-baked version of a "lite" stat()? Indeed. It is best to be able to say what the application wants. Have a look at the Mac OS X getattrlist() system call (note there is a setattrlist(), too but that is off topic wrt to the stat() function). You can see the man page here for example: http://www.hmug.org/man/2/getattrlist.php This interface btw also is made such that each file system can define which attributes it actually supports and a call to getattrlist() can determine what the current file system supports. This allows applications to tune themselves to what the file system supports that they are running on... I am not saying you should just copy it. But in case you were not aware of it you may want to at least look at it for what others have done in this area. Best regards, Anton > Also, why pass the "st_litemask" as a parameter set in the struct > (which would have to be reset on each call) instead of as a parameter > to the function (makes the calling convention much clearer)? > > int statlite(const char *fname, struct stat *buf, unsigned long *statflags); > > > [ readdirplus not referenced ] > It would be prudent, IMHO, that if we are proposing statlite() and > readdirplus() syscalls, that the readdirplus() syscall be implemented > as a special case of statlite(). It avoids the need for yet another > version in the future "readdirpluslite()" or whatever... > > Namely readdirplus() takes a "statflags" paremeter (per above) so that > the dirent_plus data only has to retrieve the required stat data (e.g. ls > -iR only needs inode number) and not all of it. Each returned stat > has a mask of valid fields in it, as e.g. some items might be in cache > already and can contain more information than others. > > > > >The website is at > > > >http://www.opengroup.org/platform/hecewg/ > > > >We in the HECIWG would welcome putting forth any NFS related ... > > Strange, group is called HECIWG, website is "hecewg"? > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 10:54 ` Andreas Dilger 2006-11-28 11:28 ` Anton Altaparmakov @ 2006-11-28 20:17 ` Russell Cattelan 2006-11-28 23:28 ` Wendy Cheng 2 siblings, 0 replies; 124+ messages in thread From: Russell Cattelan @ 2006-11-28 20:17 UTC (permalink / raw) To: Andreas Dilger; +Cc: Gary Grider, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 3339 bytes --] On Tue, 2006-11-28 at 03:54 -0700, Andreas Dilger wrote: > > > > >statlite() - asking for stat info without requiring completely > > > >accurate info like dates and sizes. This is good > > > >for running stat against a file that is open by hundreds of > > > >processes which currently forces callbacks > > > >and the hundreds of processes to flush. > > This is a big win for clustered filesystems. Some "stat" items are > a lot more work to gather than others, and if an application (e.g. > "ls --color" which is default on all distros) doesn't need anything > except the file mode to print "*" and color an executable green it > is a waste to gather the remaining ones. > > My objection to the current proposal is that it should be possible > to _completely_ specify which fields are required and which are not, > instead of having a split "required" and "optional" section to the > stat data. In some implementations, it might be desirable to only > find the file blocks (e.g. ls -s, du -s) and not the owner, links, > metadata, so why implement a half-baked version of a "lite" stat()? I half agree with allowing for a configurable stat call. But when it comes to standards and expecting everybody to adhere them a well defined list required and optional fields seems necessary. Otherwise every app will simply start asking for various fields they *think* is important, without much regard for what might be and expensive stat to obtain cluster wide and which ones are cheap. By clearly defining the list and standardizing that list the app programmer and the kernel programmer know what is expected. The mask idea sounds like a good way to implement it, but at the same time create standards defined masks for things like color ls. > > Also, why pass the "st_litemask" as a parameter set in the struct > (which would have to be reset on each call) instead of as a parameter > to the function (makes the calling convention much clearer)? > > int statlite(const char *fname, struct stat *buf, unsigned long *statflags); > > > [ readdirplus not referenced ] > It would be prudent, IMHO, that if we are proposing statlite() and > readdirplus() syscalls, that the readdirplus() syscall be implemented > as a special case of statlite(). It avoids the need for yet another > version in the future "readdirpluslite()" or whatever... > > Namely readdirplus() takes a "statflags" paremeter (per above) so that > the dirent_plus data only has to retrieve the required stat data (e.g. ls > -iR only needs inode number) and not all of it. Each returned stat > has a mask of valid fields in it, as e.g. some items might be in cache > already and can contain more information than others. > > > > >The website is at > > > >http://www.opengroup.org/platform/hecewg/ > > > >We in the HECIWG would welcome putting forth any NFS related ... > > Strange, group is called HECIWG, website is "hecewg"? > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Russell Cattelan <cattelan@thebarn.com> [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 10:54 ` Andreas Dilger 2006-11-28 11:28 ` Anton Altaparmakov 2006-11-28 20:17 ` Russell Cattelan @ 2006-11-28 23:28 ` Wendy Cheng 2006-11-29 9:12 ` Christoph Hellwig 2 siblings, 1 reply; 124+ messages in thread From: Wendy Cheng @ 2006-11-28 23:28 UTC (permalink / raw) To: Gary Grider, linux-fsdevel Andreas Dilger wrote: >>>> statlite() - asking for stat info without requiring completely >>>> accurate info like dates and sizes. This is good >>>> for running stat against a file that is open by hundreds of >>>> processes which currently forces callbacks >>>> and the hundreds of processes to flush. >>>> > > This is a big win for clustered filesystems. Some "stat" items are > a lot more work to gather than others, and if an application (e.g. > "ls --color" which is default on all distros) doesn't need anything > except the file mode to print "*" and color an executable green it > is a waste to gather the remaining ones. > Some of the described calls look very exciting and, based on our current customer issues, we have needs for them today rather than tomorrow. This "statlite()" is definitely one of them as we have been plagued by "ls" performance for a while. I'm wondering whether there are implementation efforts to push this into LKML soon ? -- Wendy ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 23:28 ` Wendy Cheng @ 2006-11-29 9:12 ` Christoph Hellwig 0 siblings, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-11-29 9:12 UTC (permalink / raw) To: Wendy Cheng; +Cc: Gary Grider, linux-fsdevel, Ameer Armaly On Tue, Nov 28, 2006 at 06:28:22PM -0500, Wendy Cheng wrote: > Some of the described calls look very exciting and, based on our current > customer issues, we have needs for them today rather than tomorrow. This > "statlite()" is definitely one of them as we have been plagued by "ls" > performance for a while. I'm wondering whether there are implementation > efforts to push this into LKML soon ? Ameer Armaly started and implementation of this, but unfortunately never posted an updated patch incorporating the review comments. See http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2 for details. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 5:54 ` Christoph Hellwig 2006-11-28 10:54 ` Andreas Dilger @ 2006-11-29 9:04 ` Christoph Hellwig 2006-11-29 9:14 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-11-29 9:04 UTC (permalink / raw) To: Gary Grider; +Cc: linux-fsdevel On Tue, Nov 28, 2006 at 05:54:28AM +0000, Christoph Hellwig wrote: > What crack do you guys have been smoking? I'd like to apologize for this statement, it was a little harsh. I still think most of these APIs are rather braindead, but then again everyone does braindead APIs from now to then. I still think it's very futile that you try to force APIs using standizations on us. Instead of going down that route please try to present a case for every single API you want, including reasonings why this can't be fixed by speeding up existing APIs. Note that with us I don't mean just linux but also other OpenSource OSes. Unless you at least get Linux and FreeBSD and Solaris to agree on the need for the API it's very pointless to go anywhere close to a standization body. Anyway, let's go on to the individual API groups: - readdirplus This one is completely unneeded as a kernel API. Doing readdir plus calls on the wire makes a lot of sense and we already do that for NFSv3+. Doing this at the syscall layer just means kernel bloat - syscalls are very cheap. - lockg I'm more than unhappy to add new kernel-level file locking calls. The whole mess of lockf vs fcntl vs leases is bad enough that we don't want to add more to it. Doing some form of advisory locks that can be implemented in userland using a shared memory region or message passing might be fine. - openg/sutoc No way. We already have a very nice file descriptor abstraction. You can pass file descriptors over unix sockets just fine. - NFSV4acls These have nothing to do at all with I/O performance. They're also sufficiently braindead. Even if you still want to push for it you shouldn't mix it up with anything else in here. - statlite The concept generally makes sense. The specified details are however very wrong. Any statlite call should operate on the normal OS-specified stat structure and have the mask of flags as an additional argument. Because of that you can only specific existing posix stat values as mandatory, but we should have an informal agreement that assigns unique mask values to extensions. This allows applications to easily fall back to stat on operating systems not supporting the flags variant, and also allows new operating systems to implement stat using the flags variant. While we're at it statlight is a really bad name for this API, following that *at APIs it should probably be {l,f,}statf. - O_LAZY This might make some sense. I'd rather implement lazyio_synchronize and lazyio_propagate as additional arguments to posix_fadvise, though. - readx/writex Again, useless bloat. Syscalls are cheap, and if you really want to submit multiple s/g I/Os at the same time and wait for all of them use the Posix AIO APIs or something like Linux's io_submit. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 9:04 ` Christoph Hellwig @ 2006-11-29 9:14 ` Christoph Hellwig 2006-11-29 9:48 ` Andreas Dilger 2006-11-29 12:23 ` Matthew Wilcox 2 siblings, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-11-29 9:14 UTC (permalink / raw) To: Gary Grider; +Cc: linux-fsdevel On Wed, Nov 29, 2006 at 09:04:50AM +0000, Christoph Hellwig wrote: > - statlite > > The concept generally makes sense. The specified details are however > very wrong. Any statlite call should operate on the normal > OS-specified stat structure and have the mask of flags as an > additional argument. Because of that you can only specific > existing posix stat values as mandatory, but we should have an > informal agreement that assigns unique mask values to extensions. > This allows applications to easily fall back to stat on operating > systems not supporting the flags variant, and also allows new > operating systems to implement stat using the flags variant. > While we're at it statlight is a really bad name for this API, > following that *at APIs it should probably be {l,f,}statf. Just thinking about the need to add another half a dozend syscalls for this. What about somehow funneling this into the flags argument of the {f,l,}statat syscalls? ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 9:04 ` Christoph Hellwig 2006-11-29 9:14 ` Christoph Hellwig @ 2006-11-29 9:48 ` Andreas Dilger 2006-11-29 10:18 ` Anton Altaparmakov ` (2 more replies) 2006-11-29 12:23 ` Matthew Wilcox 2 siblings, 3 replies; 124+ messages in thread From: Andreas Dilger @ 2006-11-29 9:48 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Gary Grider, linux-fsdevel On Nov 29, 2006 09:04 +0000, Christoph Hellwig wrote: > - readdirplus > > This one is completely unneeded as a kernel API. Doing readdir > plus calls on the wire makes a lot of sense and we already do > that for NFSv3+. Doing this at the syscall layer just means > kernel bloat - syscalls are very cheap. The question is how does the filesystem know that the application is going to do readdir + stat every file? It has to do this as a heuristic implemented in the filesystem to determine if the ->getattr() calls match the ->readdir() order. If the application knows that it is going to be doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem take advantage of this information? If combined with the statlite interface, it can make a huge difference for clustered filesystems. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 9:48 ` Andreas Dilger @ 2006-11-29 10:18 ` Anton Altaparmakov 2006-11-29 8:26 ` Brad Boyer 2006-11-29 10:25 ` NFSv4/pNFS possible POSIX I/O API standards Steven Whitehouse 2006-12-01 15:52 ` Ric Wheeler 2 siblings, 1 reply; 124+ messages in thread From: Anton Altaparmakov @ 2006-11-29 10:18 UTC (permalink / raw) To: Andreas Dilger; +Cc: Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, 2006-11-29 at 01:48 -0800, Andreas Dilger wrote: > On Nov 29, 2006 09:04 +0000, Christoph Hellwig wrote: > > - readdirplus > > > > This one is completely unneeded as a kernel API. Doing readdir > > plus calls on the wire makes a lot of sense and we already do > > that for NFSv3+. Doing this at the syscall layer just means > > kernel bloat - syscalls are very cheap. > > The question is how does the filesystem know that the application is > going to do readdir + stat every file? It has to do this as a heuristic > implemented in the filesystem to determine if the ->getattr() calls match > the ->readdir() order. If the application knows that it is going to be > doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem > take advantage of this information? If combined with the statlite > interface, it can make a huge difference for clustered filesystems. The other thing is that a readdirplus at least for some file systems can be implemented much more efficiently than readdir + stat because the directory entry itself contains a lot of extra information. To take NTFS as an example I know something about, the directory entry caches the a/c/m time as well as the data file size (needed for "ls") and the allocated on disk file size (needed for "du") as well as the inode number corresponding to the name, the flags of the inode (read-only, hidden, system, whether it is a file or directory, etc) and some other tidbits so readdirplus on NTFS can simply return wanted information without ever having to do a lookup() on the file name to obtain the inode to then use that in the stat() system call... The potential decrease in work needed is tremendous in this case... Imagine "ls -li" running with a single readdirplus() syscall and that is all that happens on the kernel side, too. Not a single file name needs to be looked up and not a single inode needs to be loaded. I don't think anyone can deny that that would be a massive speedup of "ls -li" for file systems whose directory entries store extra information to traditional unix file systems... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 10:18 ` Anton Altaparmakov @ 2006-11-29 8:26 ` Brad Boyer 2006-11-30 9:25 ` Christoph Hellwig 0 siblings, 1 reply; 124+ messages in thread From: Brad Boyer @ 2006-11-29 8:26 UTC (permalink / raw) To: Anton Altaparmakov Cc: Andreas Dilger, Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, Nov 29, 2006 at 10:18:42AM +0000, Anton Altaparmakov wrote: > To take NTFS as an example I know something about, the directory entry > caches the a/c/m time as well as the data file size (needed for "ls") > and the allocated on disk file size (needed for "du") as well as the > inode number corresponding to the name, the flags of the inode > (read-only, hidden, system, whether it is a file or directory, etc) and > some other tidbits so readdirplus on NTFS can simply return wanted > information without ever having to do a lookup() on the file name to > obtain the inode to then use that in the stat() system call... The > potential decrease in work needed is tremendous in this case... > > Imagine "ls -li" running with a single readdirplus() syscall and that is > all that happens on the kernel side, too. Not a single file name needs > to be looked up and not a single inode needs to be loaded. I don't > think anyone can deny that that would be a massive speedup of "ls -li" > for file systems whose directory entries store extra information to > traditional unix file systems... For a more extreme case, hfs and hfsplus don't even have a separation between directory entries and inode information. The code creates this separation synthetically to match the expectations of the kernel. During a readdir(), the full catalog record is loaded from disk, but all that is used is the information passed back to the filldir callback. The only thing that would be needed to return extra information would be code to copy information from the internal structure to whatever the system call used to return data to the program. Brad Boyer flar@allandria.com ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 8:26 ` Brad Boyer @ 2006-11-30 9:25 ` Christoph Hellwig 2006-11-30 17:49 ` Sage Weil 0 siblings, 1 reply; 124+ messages in thread From: Christoph Hellwig @ 2006-11-30 9:25 UTC (permalink / raw) To: Brad Boyer Cc: Anton Altaparmakov, Andreas Dilger, Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, Nov 29, 2006 at 12:26:22AM -0800, Brad Boyer wrote: > For a more extreme case, hfs and hfsplus don't even have a separation > between directory entries and inode information. The code creates this > separation synthetically to match the expectations of the kernel. During > a readdir(), the full catalog record is loaded from disk, but all that > is used is the information passed back to the filldir callback. The only > thing that would be needed to return extra information would be code to > copy information from the internal structure to whatever the system call > used to return data to the program. In this case you can infact already instanciate inodes froms readdir. Take a look at the NFS code. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-30 9:25 ` Christoph Hellwig @ 2006-11-30 17:49 ` Sage Weil 2006-12-01 5:26 ` Trond Myklebust 0 siblings, 1 reply; 124+ messages in thread From: Sage Weil @ 2006-11-30 17:49 UTC (permalink / raw) To: Christoph Hellwig Cc: Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Thu, 30 Nov 2006, Christoph Hellwig wrote: > On Wed, Nov 29, 2006 at 12:26:22AM -0800, Brad Boyer wrote: >> For a more extreme case, hfs and hfsplus don't even have a separation >> between directory entries and inode information. The code creates this >> separation synthetically to match the expectations of the kernel. During >> a readdir(), the full catalog record is loaded from disk, but all that >> is used is the information passed back to the filldir callback. The only >> thing that would be needed to return extra information would be code to >> copy information from the internal structure to whatever the system call >> used to return data to the program. > > In this case you can infact already instanciate inodes froms readdir. > Take a look at the NFS code. Sure. And having readdirplus over the wire is a great performance win for NFS, but it works only because NFS metadata consistency is already weak. Giving applications an atomic readdirplus makes things considerably simpler for distributed filesystems that want to provide strong consistency (and a reasonable interpretation of what POSIX semantics mean for a distributed filesystem). In particular, it allows the application (e.g. ls --color or -al) to communicate to the kernel and filesystem that it doesn't care about the relative ordering of each subsequent stat() with respect to other writers (possibly on different hosts, with whom synchronization can incur a heavy performance penalty), but rather only wants a snapshot of dentry+inode state. As Andreas already mentioned, detecting this (exceedingly common) case may be possible with heuristics (e.g. watching the ordering of stat() calls vs the filldir resuls), but that's hardly ideal when a cleaner interface can explicitly capture the application's requirements. sage ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-30 17:49 ` Sage Weil @ 2006-12-01 5:26 ` Trond Myklebust 2006-12-01 7:08 ` Sage Weil 0 siblings, 1 reply; 124+ messages in thread From: Trond Myklebust @ 2006-12-01 5:26 UTC (permalink / raw) To: Sage Weil Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Thu, 2006-11-30 at 09:49 -0800, Sage Weil wrote: > Giving applications an atomic readdirplus makes things considerably > simpler for distributed filesystems that want to provide strong > consistency (and a reasonable interpretation of what POSIX semantics mean > for a distributed filesystem). In particular, it allows the application > (e.g. ls --color or -al) to communicate to the kernel and filesystem that > it doesn't care about the relative ordering of each subsequent stat() with > respect to other writers (possibly on different hosts, with whom > synchronization can incur a heavy performance penalty), but rather only > wants a snapshot of dentry+inode state. What exactly do you mean by an "atomic readdirplus"? Standard readdir is by its very nature weakly cached, and there is no guarantee whatsoever even that you will see all files in the directory. See the SuSv3 definition, which explicitly states that there is no ordering w.r.t. file creation/deletion: The type DIR, which is defined in the <dirent.h> header, represents a directory stream, which is an ordered sequence of all the directory entries in a particular directory. Directory entries represent files; files may be removed from a directory or added to a directory asynchronously to the operation of readdir(). Besides, why would your application care about atomicity of the attribute information unless you also have some form of locking to guarantee that said information remains valid until you are done processing it? Cheers, Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 5:26 ` Trond Myklebust @ 2006-12-01 7:08 ` Sage Weil 2006-12-01 14:41 ` Trond Myklebust 0 siblings, 1 reply; 124+ messages in thread From: Sage Weil @ 2006-12-01 7:08 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Fri, 1 Dec 2006, Trond Myklebust wrote: > What exactly do you mean by an "atomic readdirplus"? Standard readdir is > by its very nature weakly cached, and there is no guarantee whatsoever > even that you will see all files in the directory. See the SuSv3 > definition, which explicitly states that there is no ordering w.r.t. > file creation/deletion: > > The type DIR, which is defined in the <dirent.h> header, > represents a directory stream, which is an ordered sequence of > all the directory entries in a particular directory. Directory > entries represent files; files may be removed from a directory > or added to a directory asynchronously to the operation of > readdir(). I mean atomic only in the sense that the stat result returned by readdirplus() would reflect the file state at some point during the time consumed by that system call. In contrast, when you call stat() separately, it's expected that the result you get back reflects the state at some time during the stat() call, and not the readdir() that may have preceeded it. readdir() results may be weakly cached, but stat() results normally aren't (ignoring the usual NFS behavior for the moment). It's the stat() part of readdir() + stat() that makes life unnecessarily difficult for a filesystem providing strong consistency. How can the filesystem know that 'ls' doesn't care if the stat() results are accurate at the time of the readdir() and not the subsequent stat()? Something like readdirplus() allows that to be explicitly communicated, without resorting to heuristics or weak metadata consistency (ala NFS attribute caching). For distributed or network filesystems that can be a big win. (Admittedly, there's probably little benefit for local filesystems beyond the possibility of better prefetching, if syscalls are as cheap as Christoph says.) > Besides, why would your application care about atomicity of the > attribute information unless you also have some form of locking to > guarantee that said information remains valid until you are done > processing it? Something like 'ls' certainly doesn't care, but in general applications do care that stat() results aren't cached. They expect the stat results to reflect the file's state at a point in time _after_ they decide to call stat(). For example, for process A to see how much data a just-finished process B wrote to a file... sage ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 7:08 ` Sage Weil @ 2006-12-01 14:41 ` Trond Myklebust 2006-12-01 16:47 ` Sage Weil 2006-12-03 1:52 ` Andreas Dilger 0 siblings, 2 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-01 14:41 UTC (permalink / raw) To: Sage Weil Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Thu, 2006-11-30 at 23:08 -0800, Sage Weil wrote: > I mean atomic only in the sense that the stat result returned by > readdirplus() would reflect the file state at some point during the time > consumed by that system call. In contrast, when you call stat() > separately, it's expected that the result you get back reflects the state > at some time during the stat() call, and not the readdir() that may > have preceeded it. readdir() results may be weakly cached, but stat() > results normally aren't (ignoring the usual NFS behavior for the moment). > > It's the stat() part of readdir() + stat() that makes life unnecessarily > difficult for a filesystem providing strong consistency. How can the > filesystem know that 'ls' doesn't care if the stat() results are accurate > at the time of the readdir() and not the subsequent stat()? Something > like readdirplus() allows that to be explicitly communicated, without > resorting to heuristics or weak metadata consistency (ala NFS attribute > caching). For distributed or network filesystems that can be a big win. > (Admittedly, there's probably little benefit for local filesystems beyond > the possibility of better prefetching, if syscalls are as cheap as > Christoph says.) 'ls --color' and 'find' don't give a toss about most of the arguments from 'stat()'. They just want to know what kind of filesystem object they are dealing with. We already provide that information in the readdir() syscall via the 'd_type' field. Adding all the other stat() information is just going to add unnecessary synchronisation burdens. > > Besides, why would your application care about atomicity of the > > attribute information unless you also have some form of locking to > > guarantee that said information remains valid until you are done > > processing it? > > Something like 'ls' certainly doesn't care, but in general applications do > care that stat() results aren't cached. They expect the stat results to > reflect the file's state at a point in time _after_ they decide to call > stat(). For example, for process A to see how much data a just-finished > process B wrote to a file... AFAICS, it will not change any consistency semantics. The main irritation it will introduce will be that the NFS client will suddenly have to do things like synchronising readdirplus() and file write() in order to provide the POSIX guarantees that you mentioned. i.e: if someone has written data to one of the files in the directory, then an NFS client will now have to flush that data out before calling readdir so that the server returns the correct m/ctime or file size. Previously, it could delay that until the stat() call. Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 14:41 ` Trond Myklebust @ 2006-12-01 16:47 ` Sage Weil 2006-12-01 18:07 ` Trond Myklebust 2006-12-03 1:52 ` Andreas Dilger 1 sibling, 1 reply; 124+ messages in thread From: Sage Weil @ 2006-12-01 16:47 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Fri, 1 Dec 2006, Trond Myklebust wrote: > 'ls --color' and 'find' don't give a toss about most of the arguments > from 'stat()'. They just want to know what kind of filesystem object > they are dealing with. We already provide that information in the > readdir() syscall via the 'd_type' field. Adding all the other stat() > information is just going to add unnecessary synchronisation burdens. 'ls -al' cares about the stat() results, but does not care about the relative timing accuracy wrt the preceeding readdir(). I'm not sure why 'ls --color' still calls stat when it can get that from the readdir() results, but either way it's asking more from the kernel/filesystem than it needs. >> Something like 'ls' certainly doesn't care, but in general applications do >> care that stat() results aren't cached. They expect the stat results to >> reflect the file's state at a point in time _after_ they decide to call >> stat(). For example, for process A to see how much data a just-finished >> process B wrote to a file... > > AFAICS, it will not change any consistency semantics. The main > irritation it will introduce will be that the NFS client will suddenly > have to do things like synchronising readdirplus() and file write() in > order to provide the POSIX guarantees that you mentioned. > > i.e: if someone has written data to one of the files in the directory, > then an NFS client will now have to flush that data out before calling > readdir so that the server returns the correct m/ctime or file size. > Previously, it could delay that until the stat() call. It sounds like you're talking about a single (asynchronous) client in a directory. In that case, the client need only flush if someone calls readdirplus() instead of readdir(), and since readdirplus() is effectively also a stat(), the situation isn't actually any different. The more interesting case is multiple clients in the same directory. In order to provide strong consistency, both stat() and readdir() have to talk to the server (or more complicated leasing mechanisms are needed). In that scenario, readdirplus() is asking for _less_ synchronization/consistency of results than readdir()+stat(), not more. i.e. both the readdir() and stat() would require a server request in order to achieve the standard POSIX semantics, while a readdirplus() would allow a single request. The NFS client already provibes weak consistency of stat() results for clients. Extending the interface doesn't suddenly require the NFS client to provide strong consistency, it just makes life easier for the implementation if it (or some other filesystem) chooses to do so. Consider two use cases. Process A is 'ls -al', who doesn't really care about when the size/mtime are from (i.e. sometime after opendir()). Process B waits for a process on another host to write to a file, and then calls stat() locally to check the result. In order for B to get the correct result, stat() _must_ return a value for size/mtime from _after_ the stat() initiated. That makes 'ls -al' slow, because it probably has to talk to the server to make sure files haven't been modified between the readdir() and stat(). In reality, 'ls -al' doesn't care, but the filesystem has no way to know that without the presense of readdirplus(). Alternatively, an NFS (or other distributed filesystem) client can cache file attributes to make 'ls -al' fast, and simply break process B (as NFS currently does). readdirplus() makes it clear what 'ls -al' doesn't need, allowing the client (if it so chooses) to avoid breaking B in the general case. That simply isn't possible to explicitly communicate with the existing interface. How is that not a win? I imagine that most of the time readdirplus() will hit something in the VFS that simply calls readdir() and stat(). But a smart NFS (or other network filesytem) client can can opt to send a readdirplus over the wire for readdirplus() without sacrificing stat() consistency in the general case. sage ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 16:47 ` Sage Weil @ 2006-12-01 18:07 ` Trond Myklebust 2006-12-01 18:42 ` Sage Weil 2006-12-03 1:57 ` NFSv4/pNFS possible POSIX I/O API standards Andreas Dilger 0 siblings, 2 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-01 18:07 UTC (permalink / raw) To: Sage Weil Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Fri, 2006-12-01 at 08:47 -0800, Sage Weil wrote: > On Fri, 1 Dec 2006, Trond Myklebust wrote: > > 'ls --color' and 'find' don't give a toss about most of the arguments > > from 'stat()'. They just want to know what kind of filesystem object > > they are dealing with. We already provide that information in the > > readdir() syscall via the 'd_type' field. Adding all the other stat() > > information is just going to add unnecessary synchronisation burdens. > > 'ls -al' cares about the stat() results, but does not care about the > relative timing accuracy wrt the preceeding readdir(). I'm not sure why > 'ls --color' still calls stat when it can get that from the readdir() > results, but either way it's asking more from the kernel/filesystem than > it needs. > > >> Something like 'ls' certainly doesn't care, but in general applications do > >> care that stat() results aren't cached. They expect the stat results to > >> reflect the file's state at a point in time _after_ they decide to call > >> stat(). For example, for process A to see how much data a just-finished > >> process B wrote to a file... > > > > AFAICS, it will not change any consistency semantics. The main > > irritation it will introduce will be that the NFS client will suddenly > > have to do things like synchronising readdirplus() and file write() in > > order to provide the POSIX guarantees that you mentioned. > > > > i.e: if someone has written data to one of the files in the directory, > > then an NFS client will now have to flush that data out before calling > > readdir so that the server returns the correct m/ctime or file size. > > Previously, it could delay that until the stat() call. > > It sounds like you're talking about a single (asynchronous) client in a > directory. In that case, the client need only flush if someone calls > readdirplus() instead of readdir(), and since readdirplus() is effectively > also a stat(), the situation isn't actually any different. > > The more interesting case is multiple clients in the same directory. In > order to provide strong consistency, both stat() and readdir() have to > talk to the server (or more complicated leasing mechanisms are needed). Why would that be interesting? What applications do you have that require strong consistency in that scenario? I keep looking for uses for strong cache consistency with no synchronisation, but I have yet to meet someone who has an actual application that relies on it. > In that scenario, readdirplus() is asking for _less_ > synchronization/consistency of results than readdir()+stat(), not more. > i.e. both the readdir() and stat() would require a server request in order > to achieve the standard POSIX semantics, while a readdirplus() would allow > a single request. The NFS client already provibes weak consistency of > stat() results for clients. Extending the interface doesn't suddenly > require the NFS client to provide strong consistency, it just makes life > easier for the implementation if it (or some other filesystem) chooses to > do so. I'm quite happy with a proposal for a statlite(). I'm objecting to readdirplus() because I can't see that it offers you anything useful. You haven't provided an example of an application which would clearly benefit from a readdirplus() interface instead of readdir()+statlite() and possibly some tools for managing cache consistency. > Consider two use cases. Process A is 'ls -al', who doesn't really care > about when the size/mtime are from (i.e. sometime after opendir()). > Process B waits for a process on another host to write to a file, and then > calls stat() locally to check the result. In order for B to get the > correct result, stat() _must_ return a value for size/mtime from _after_ > the stat() initiated. That makes 'ls -al' slow, because it probably has > to talk to the server to make sure files haven't been modified between the > readdir() and stat(). In reality, 'ls -al' doesn't care, but the > filesystem has no way to know that without the presense of readdirplus(). > Alternatively, an NFS (or other distributed filesystem) client can cache > file attributes to make 'ls -al' fast, and simply break process B (as NFS > currently does). readdirplus() makes it clear what 'ls -al' doesn't need, > allowing the client (if it so chooses) to avoid breaking B in the general > case. That simply isn't possible to explicitly communicate with the > existing interface. How is that not a win? Using readdir() to monitor size/mtime on individual files is hardly a case we want to optimise for. There are better tools, including inotify() for applications that care. I agree that an interface which allows a userland process offer hints to the kernel as to what kind of cache consistency it requires for file metadata would be useful. We already have stuff like posix_fadvise() etc for file data, and perhaps it might be worth looking into how you could devise something similar for metadata. If what you really want is for applications to be able to manage network filesystem cache consistency, then why not provide those tools instead? Cheers, Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 18:07 ` Trond Myklebust @ 2006-12-01 18:42 ` Sage Weil 2006-12-01 19:13 ` Trond Myklebust 2006-12-04 18:02 ` Peter Staubach 2006-12-03 1:57 ` NFSv4/pNFS possible POSIX I/O API standards Andreas Dilger 1 sibling, 2 replies; 124+ messages in thread From: Sage Weil @ 2006-12-01 18:42 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Fri, 1 Dec 2006, Trond Myklebust wrote: > I'm quite happy with a proposal for a statlite(). I'm objecting to > readdirplus() because I can't see that it offers you anything useful. > You haven't provided an example of an application which would clearly > benefit from a readdirplus() interface instead of readdir()+statlite() > and possibly some tools for managing cache consistency. Okay, now I think I understand where you're coming from. The difference between readdirplus() and readdir()+statlite() is that (depending on the mask you specify) statlite() either provides the "right" answer (ala stat()), or anything that is vaguely "recent." readdirplus() would provide size/mtime from sometime _after_ the initial opendir() call, establishing a useful ordering. So without readdirplus(), you either get readdir()+stat() and the performance problems I mentioned before, or readdir()+statlite() where "recent" may not be good enough. Instead of my previous example of proccess #1 waiting for process #2 to finish and then checking the results with stat(), imagine instead that #1 is waiting for 100,000 other processes to finish, and then wants to check the results (size/mtime) of all of them. readdir()+statlite() won't work, and readdir()+stat() may be pathologically slow. Also, it's a tiring and trivial example, but even the 'ls -al' scenario isn't ideally addressed by readdir()+statlite(), since statlite() might return size/mtime from before 'ls -al' was executed by the user. One can easily imagine modifying a file on one host, then doing 'ls -al' on another host and not seeing the effects. If 'ls -al' can use readdirplus(), it's overall application semantics can be preserved without hammering large directories in a distributed filesystem. > I agree that an interface which allows a userland process offer hints to > the kernel as to what kind of cache consistency it requires for file > metadata would be useful. We already have stuff like posix_fadvise() etc > for file data, and perhaps it might be worth looking into how you could > devise something similar for metadata. > If what you really want is for applications to be able to manage network > filesystem cache consistency, then why not provide those tools instead? True, something to manage the attribute cache consistency for statlite() results would also address the issue by letting an application declare how weak it's results are allowed to be. That seems a bit more awkward, though, and would only affect statlite()--the only call that allows weak consistency in the first place. In contrast, readdirplus maps nicely onto what filesystems like NFS are already doing over the wire. sage ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 18:42 ` Sage Weil @ 2006-12-01 19:13 ` Trond Myklebust 2006-12-01 20:32 ` Sage Weil 2006-12-04 18:02 ` Peter Staubach 1 sibling, 1 reply; 124+ messages in thread From: Trond Myklebust @ 2006-12-01 19:13 UTC (permalink / raw) To: Sage Weil Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Fri, 2006-12-01 at 10:42 -0800, Sage Weil wrote: > On Fri, 1 Dec 2006, Trond Myklebust wrote: > > I'm quite happy with a proposal for a statlite(). I'm objecting to > > readdirplus() because I can't see that it offers you anything useful. > > You haven't provided an example of an application which would clearly > > benefit from a readdirplus() interface instead of readdir()+statlite() > > and possibly some tools for managing cache consistency. > > Okay, now I think I understand where you're coming from. > > The difference between readdirplus() and readdir()+statlite() is that > (depending on the mask you specify) statlite() either provides the "right" > answer (ala stat()), or anything that is vaguely "recent." readdirplus() > would provide size/mtime from sometime _after_ the initial opendir() call, > establishing a useful ordering. So without readdirplus(), you either get > readdir()+stat() and the performance problems I mentioned before, or > readdir()+statlite() where "recent" may not be good enough. > > Instead of my previous example of proccess #1 waiting for process #2 to > finish and then checking the results with stat(), imagine instead that #1 > is waiting for 100,000 other processes to finish, and then wants to check > the results (size/mtime) of all of them. readdir()+statlite() won't > work, and readdir()+stat() may be pathologically slow. > > Also, it's a tiring and trivial example, but even the 'ls -al' scenario > isn't ideally addressed by readdir()+statlite(), since statlite() might > return size/mtime from before 'ls -al' was executed by the user. stat() will do the same. > One can > easily imagine modifying a file on one host, then doing 'ls -al' on > another host and not seeing the effects. If 'ls -al' can use > readdirplus(), it's overall application semantics can be preserved without > hammering large directories in a distributed filesystem. So readdirplus() would not even be cached? Yech! > > I agree that an interface which allows a userland process offer hints to > > the kernel as to what kind of cache consistency it requires for file > > metadata would be useful. We already have stuff like posix_fadvise() etc > > for file data, and perhaps it might be worth looking into how you could > > devise something similar for metadata. > > If what you really want is for applications to be able to manage network > > filesystem cache consistency, then why not provide those tools instead? > > True, something to manage the attribute cache consistency for statlite() > results would also address the issue by letting an application declare how > weak it's results are allowed to be. That seems a bit more awkward, > though, and would only affect statlite()--the only call that allows weak > consistency in the first place. In contrast, readdirplus maps nicely onto > what filesystems like NFS are already doing over the wire. Currently, you will never get anything other than weak consistency with NFS whether you are talking about stat(), access(), getacl(), lseek(SEEK_END), or append(). Your 'permitting it' only in statlite() is irrelevant to the facts on the ground: I am not changing the NFS client caching model in any way that would affect existing applications. Cheers Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 19:13 ` Trond Myklebust @ 2006-12-01 20:32 ` Sage Weil 0 siblings, 0 replies; 124+ messages in thread From: Sage Weil @ 2006-12-01 20:32 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Fri, 1 Dec 2006, Trond Myklebust wrote: >> Also, it's a tiring and trivial example, but even the 'ls -al' scenario >> isn't ideally addressed by readdir()+statlite(), since statlite() might >> return size/mtime from before 'ls -al' was executed by the user. > > stat() will do the same. It does with NFS, but only because NFS doesn't follow POSIX in that regard. In general, stat() is supposed to return a value that's accurate at the time of the call. (Although now I'm confused again. If you're assuming stat() can return cached results, why do you think statlite() is useful?) > Currently, you will never get anything other than weak consistency with > NFS whether you are talking about stat(), access(), getacl(), > lseek(SEEK_END), or append(). Your 'permitting it' only in statlite() is > irrelevant to the facts on the ground: I am not changing the NFS client > caching model in any way that would affect existing applications. Clearly, if you cache attributes on the client and provide only weak consistency, then readdirplus() doesn't change much. But _other_ non-NFS filesystems may elect to provide POSIX semantics and strong consistency, even though NFS doesn't. And the interface simply doesn't allow that to be done efficiently in distributed environments, because applications can't communicate their varying consistency needs. Instead, systems like NFS weaken attribute consistency globally. That works well enough for most people most of the time, but it's hardly ideal. readdirplus() allows applications like 'ls -al' to distinguish themselves from applications that want individually accurate stat() results. That in turn allows distributed filesystems that are both strongly consistent _and_ efficient at scale. In most cases, it'll trivially turn into a readdir()+stat() in the VFS, but in some cases filesystems can exploit that information for (often enormous) performance gain, while still maintaining well-defined consistency semantics. readdir() already leaks some inode information into it's result (via d_type)... I'm not sure I understand the resistance to providing more. sage ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 18:42 ` Sage Weil 2006-12-01 19:13 ` Trond Myklebust @ 2006-12-04 18:02 ` Peter Staubach 2006-12-05 23:20 ` readdirplus() as possible POSIX I/O API Sage Weil 1 sibling, 1 reply; 124+ messages in thread From: Peter Staubach @ 2006-12-04 18:02 UTC (permalink / raw) To: Sage Weil Cc: Trond Myklebust, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel Sage Weil wrote: > On Fri, 1 Dec 2006, Trond Myklebust wrote: >> I'm quite happy with a proposal for a statlite(). I'm objecting to >> readdirplus() because I can't see that it offers you anything useful. >> You haven't provided an example of an application which would clearly >> benefit from a readdirplus() interface instead of readdir()+statlite() >> and possibly some tools for managing cache consistency. > > Okay, now I think I understand where you're coming from. > > The difference between readdirplus() and readdir()+statlite() is that > (depending on the mask you specify) statlite() either provides the > "right" answer (ala stat()), or anything that is vaguely "recent." > readdirplus() would provide size/mtime from sometime _after_ the > initial opendir() call, establishing a useful ordering. So without > readdirplus(), you either get readdir()+stat() and the performance > problems I mentioned before, or readdir()+statlite() where "recent" > may not be good enough. > > Instead of my previous example of proccess #1 waiting for process #2 > to finish and then checking the results with stat(), imagine instead > that #1 is waiting for 100,000 other processes to finish, and then > wants to check the results (size/mtime) of all of them. > readdir()+statlite() won't work, and readdir()+stat() may be > pathologically slow. > > Also, it's a tiring and trivial example, but even the 'ls -al' > scenario isn't ideally addressed by readdir()+statlite(), since > statlite() might return size/mtime from before 'ls -al' was executed > by the user. One can easily imagine modifying a file on one host, > then doing 'ls -al' on another host and not seeing the effects. If > 'ls -al' can use readdirplus(), it's overall application semantics can > be preserved without hammering large directories in a distributed > filesystem. > I think that there are several points which are missing here. First, readdirplus(), without any sort of caching, is going to be _very_ expensive, performance-wise, for _any_ size directory. You can see this by instrumenting any NFS server which already supports the NFSv3 READDIRPLUS semantics. Second, the NFS client side readdirplus() implementation is going to be _very_ expensive as well. The NFS client does write-behind and all this data _must_ be flushed to the server _before_ the over the wire READDIRPLUS can be issued. This means that the client will have to step through every inode which is associated with the directory inode being readdirplus()'d and ensure that all modified data has been successfully written out. This part of the operation, for a sufficiently large directory and a sufficiently large page cache, could take signficant time in itself. These overheads may make this new operation expensive enough that no applications will end up using it. >> I agree that an interface which allows a userland process offer hints to >> the kernel as to what kind of cache consistency it requires for file >> metadata would be useful. We already have stuff like posix_fadvise() etc >> for file data, and perhaps it might be worth looking into how you could >> devise something similar for metadata. >> If what you really want is for applications to be able to manage network >> filesystem cache consistency, then why not provide those tools instead? > > True, something to manage the attribute cache consistency for > statlite() results would also address the issue by letting an > application declare how weak it's results are allowed to be. That > seems a bit more awkward, though, and would only affect > statlite()--the only call that allows weak consistency in the first > place. In contrast, readdirplus maps nicely onto what filesystems > like NFS are already doing over the wire. Speaking of applications, how many applications are there in the world, or even being contemplated, which are interested in a directory of files and whether or not this set of files has changed from the previous snapshot of the set of files? Most applications deal with one or two files on such a basis, not multitudes. In fact, having worked with file systems and NFS in particular for more than 20 years now, I have yet to hear of one. This is a lot of work and complexity for very little gain, I think. Is this not a problem which be better solved at the application level? Or perhaps finer granularity than "noac" for the NFS attribute caching? Thanx... ps ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: readdirplus() as possible POSIX I/O API 2006-12-04 18:02 ` Peter Staubach @ 2006-12-05 23:20 ` Sage Weil 2006-12-06 15:48 ` Peter Staubach 0 siblings, 1 reply; 124+ messages in thread From: Sage Weil @ 2006-12-05 23:20 UTC (permalink / raw) To: Peter Staubach Cc: Trond Myklebust, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel On Mon, 4 Dec 2006, Peter Staubach wrote: > I think that there are several points which are missing here. > > First, readdirplus(), without any sort of caching, is going to be _very_ > expensive, performance-wise, for _any_ size directory. You can see this > by instrumenting any NFS server which already supports the NFSv3 READDIRPLUS > semantics. Are you referring to the work the server must do to gather stat information for each inode? > Second, the NFS client side readdirplus() implementation is going to be > _very_ expensive as well. The NFS client does write-behind and all this > data _must_ be flushed to the server _before_ the over the wire READDIRPLUS > can be issued. This means that the client will have to step through every > inode which is associated with the directory inode being readdirplus()'d > and ensure that all modified data has been successfully written out. This > part of the operation, for a sufficiently large directory and a sufficiently > large page cache, could take signficant time in itself. Why can't the client send the over the wire READDIRPLUS without flushing inode data, and then simply ignore the stat portion of the server's response in instances where it's locally cached (and dirty) inode data is newer than the server's? > These overheads may make this new operation expensive enough that no > applications will end up using it. If the application calls readdirplus() only when it would otherwise do readdir()+stat(), the flushing you mention would happen anyway (from the stat()). Wouldn't this at least allow that to happen in parallel for the whole directory? sage ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: readdirplus() as possible POSIX I/O API 2006-12-05 23:20 ` readdirplus() as possible POSIX I/O API Sage Weil @ 2006-12-06 15:48 ` Peter Staubach 0 siblings, 0 replies; 124+ messages in thread From: Peter Staubach @ 2006-12-06 15:48 UTC (permalink / raw) To: Sage Weil Cc: Trond Myklebust, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Andreas Dilger, Gary Grider, linux-fsdevel Sage Weil wrote: > On Mon, 4 Dec 2006, Peter Staubach wrote: >> I think that there are several points which are missing here. >> >> First, readdirplus(), without any sort of caching, is going to be _very_ >> expensive, performance-wise, for _any_ size directory. You can see this >> by instrumenting any NFS server which already supports the NFSv3 >> READDIRPLUS >> semantics. > > Are you referring to the work the server must do to gather stat > information for each inode? > Yes and the fact that the client will be forced to go over the wire for each readdirplus() call, whereas it can use cached information today. An application actually waiting on the response to a READDIRPLUS will not be pleased at the resulting performance. >> Second, the NFS client side readdirplus() implementation is going to be >> _very_ expensive as well. The NFS client does write-behind and all this >> data _must_ be flushed to the server _before_ the over the wire >> READDIRPLUS >> can be issued. This means that the client will have to step through >> every >> inode which is associated with the directory inode being readdirplus()'d >> and ensure that all modified data has been successfully written out. >> This >> part of the operation, for a sufficiently large directory and a >> sufficiently >> large page cache, could take signficant time in itself. > > Why can't the client send the over the wire READDIRPLUS without > flushing inode data, and then simply ignore the stat portion of the > server's response in instances where it's locally cached (and dirty) > inode data is newer than the server's? > This would seem to minimize the value as far as I understand the requirements here. >> These overheads may make this new operation expensive enough that no >> applications will end up using it. > > If the application calls readdirplus() only when it would otherwise do > readdir()+stat(), the flushing you mention would happen anyway (from > the stat()). Wouldn't this at least allow that to happen in parallel > for the whole directory? I don't see where the parallelism comes from. Before issuing the READDIRPLUS over the wire, the client would have to ensure that each and every one of those flushes was completed. I suppose that a sufficiently clever and complex implementation could figure out how to schedule all those flushes asynchronously and then wait for all of them to complete, but there will be a performance cost. Walking the caches for all of those inodes, perhaps using several or all of the cpus in the system, smacking the server with all of those WRITE operations simultaneously with all of the associated network bandwidth usage, all adds up to other applications on the client and potentially the network not doing much at the same time. All of this cost to the system and to the network for the benefit of a single application? That seems like a tough sell to me. This is an easy problem to look at from the application viewpoint. The solution seems obvious. Give it the fastest possible way to read the directory and retrieve stat information about every entry in the directory. However, when viewed from a systemic level, this becomes a very different problem with many more aspects. Perhaps flow controlling this one application in favor of many other applications, running network wide, may be the better thing to continue to do. I dunno. ps ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 18:07 ` Trond Myklebust 2006-12-01 18:42 ` Sage Weil @ 2006-12-03 1:57 ` Andreas Dilger 2006-12-03 7:34 ` Kari Hurtta 1 sibling, 1 reply; 124+ messages in thread From: Andreas Dilger @ 2006-12-03 1:57 UTC (permalink / raw) To: Trond Myklebust Cc: Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 01, 2006 13:07 -0500, Trond Myklebust wrote: > > The more interesting case is multiple clients in the same directory. In > > order to provide strong consistency, both stat() and readdir() have to > > talk to the server (or more complicated leasing mechanisms are needed). > > Why would that be interesting? What applications do you have that > require strong consistency in that scenario? I keep looking for uses for > strong cache consistency with no synchronisation, but I have yet to meet > someone who has an actual application that relies on it. To be honest, I can't think of any use that actually _requires_ consistency from stat() or readdir(), because even if the data was valid in the kernel at the time it was gathered, there is no guarantee all the files haven't been deleted by another thread even before the syscall is complete. Any pretending that the returned data is "current" is a pipe dream. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-03 1:57 ` NFSv4/pNFS possible POSIX I/O API standards Andreas Dilger @ 2006-12-03 7:34 ` Kari Hurtta 0 siblings, 0 replies; 124+ messages in thread From: Kari Hurtta @ 2006-12-03 7:34 UTC (permalink / raw) To: linux-fsdevel Andreas Dilger <adilger@clusterfs.com> writes in gmane.linux.file-systems: > On Dec 01, 2006 13:07 -0500, Trond Myklebust wrote: > > > The more interesting case is multiple clients in the same directory. In > > > order to provide strong consistency, both stat() and readdir() have to > > > talk to the server (or more complicated leasing mechanisms are needed). > > > > Why would that be interesting? What applications do you have that > > require strong consistency in that scenario? I keep looking for uses for > > strong cache consistency with no synchronisation, but I have yet to meet > > someone who has an actual application that relies on it. > > To be honest, I can't think of any use that actually _requires_ consistency > from stat() or readdir(), because even if the data was valid in the kernel > at the time it was gathered, there is no guarantee all the files haven't > been deleted by another thread even before the syscall is complete. Any > pretending that the returned data is "current" is a pipe dream. But I can think that it is assumed other kind consistency: All fields of stat refers to same state and moment of file. > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. / Kari Hurtta ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 14:41 ` Trond Myklebust 2006-12-01 16:47 ` Sage Weil @ 2006-12-03 1:52 ` Andreas Dilger 2006-12-03 16:10 ` Sage Weil 1 sibling, 1 reply; 124+ messages in thread From: Andreas Dilger @ 2006-12-03 1:52 UTC (permalink / raw) To: Trond Myklebust Cc: Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 01, 2006 09:41 -0500, Trond Myklebust wrote: > 'ls --color' and 'find' don't give a toss about most of the arguments > from 'stat()'. They just want to know what kind of filesystem object > they are dealing with. We already provide that information in the > readdir() syscall via the 'd_type' field. That is _almost_ true, except that "ls --color" does a stat anyways to get the file mode (to set the "*" executable type) and the file blocks (with -s) and the size (with -l) and the inode number (with -i). In a clustered filesystem getting the inode number and mode is easily done along with the uid/gid (for many kinds of "find") while getting the file size may be non-trivial. Just to be clear, I have no desire to include any kind of "synchronization" semantics to readdirplus() that is also being discussed in this thread. Just the ability to bundle select stat info along with the readdir information, and to allow stat to not return any unnecessary info (in particular size, blocks, mtime) that may be harder to gather on a clustered filesystem. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-03 1:52 ` Andreas Dilger @ 2006-12-03 16:10 ` Sage Weil 2006-12-04 7:32 ` Andreas Dilger 0 siblings, 1 reply; 124+ messages in thread From: Sage Weil @ 2006-12-03 16:10 UTC (permalink / raw) To: Andreas Dilger Cc: Trond Myklebust, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Sat, 2 Dec 2006, Andreas Dilger wrote: > Just to be clear, I have no desire to include any kind of > "synchronization" semantics to readdirplus() that is also being discussed > in this thread. Just the ability to bundle select stat info along with > the readdir information, and to allow stat to not return any unnecessary > info (in particular size, blocks, mtime) that may be harder to gather > on a clustered filesystem. I'm not suggesting any "synchronization" beyond what opendir()/readdir() already require for the directory entries themselves. If I'm not mistaken, readdir() is required only to return directory entries as recent as the opendir() (i.e., you shouldn't see entries that were unlink()ed before you called opendir(), and intervening changes to the directory may or may not be reflected in the result, depending on how your implementation is buffering things). I would think the stat() portion of readdirplus() would be similarly (in)consistent (i.e., return a value at least as recent as the opendir()) to make life easy for the implementation and to align with existing readdir() semantics. My only concern is the "at least as recent as the opendir()" part, in contrast to statlite(), which has undefined "recentness" of its result for fields not specified in the mask. Ideally, I'd like to see readdirplus() also take a statlite() style mask, so that you can choose between either "vaguely recent" and "at least as recent as opendir()". As you mentioned, by the time you look at the result of any call (in the absence of locking) it may be out of date. But simply establishing an ordering is useful, especially in a clustered environment where some nodes are waiting for other nodes (via barriers or whatever) and then want to see the effects of previously completed fs operations. Anyway, "synchronization" semantics aside (since I appear to be somewhat alone on this :)... I'm wondering if a corresponding opendirplus() (or similar) would also be appropriate to inform the kernel/filesystem that readdirplus() will follow, and stat information should be gathered/buffered. Or do most implementations wait for the first readdir() before doing any actual work anyway? sage ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-03 16:10 ` Sage Weil @ 2006-12-04 7:32 ` Andreas Dilger 2006-12-04 15:15 ` Trond Myklebust 0 siblings, 1 reply; 124+ messages in thread From: Andreas Dilger @ 2006-12-04 7:32 UTC (permalink / raw) To: Sage Weil Cc: Trond Myklebust, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 03, 2006 08:10 -0800, Sage Weil wrote: > My only concern is the "at least as recent as the opendir()" part, > in contrast to statlite(), which has undefined "recentness" of its > result for fields not specified in the mask. > > Ideally, I'd like to see readdirplus() also take a statlite() style mask, > so that you can choose between either "vaguely recent" and "at least as > recent as opendir()". In my opinion (which may not be that of the original statlite() authors) is that the flags should really be called "valid", and any bit set in the flag can be considered valid, and any unset bit means "this field has no valid data". Having it mean "it might be out of date" gives the false impression that it might contain valid (if slightly out of date) data. > As you mentioned, by the time you look at the result of any call (in the > absence of locking) it may be out of date. But simply establishing an > ordering is useful, especially in a clustered environment where some nodes > are waiting for other nodes (via barriers or whatever) and then want to > see the effects of previously completed fs operations. Ah, OK. I didn't understand what you were getting at before. I agree that it makes sense to have the same semantics as readdir() in this regard. > I'm wondering if a corresponding opendirplus() (or similar) would also be > appropriate to inform the kernel/filesystem that readdirplus() will > follow, and stat information should be gathered/buffered. Or do most > implementations wait for the first readdir() before doing any actual work > anyway? I'm not sure what some filesystems might do here. I suppose NFS has weak enough cache semantics that it _might_ return stale cached data from the client in order to fill the readdirplus() data, but it is just as likely that it ships the whole thing to the server and returns everything in one shot. That would imply everything would be at least as up-to-date as the opendir(). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-04 7:32 ` Andreas Dilger @ 2006-12-04 15:15 ` Trond Myklebust 2006-12-05 0:59 ` Rob Ross 2006-12-05 10:26 ` readdirplus() as possible POSIX I/O API Andreas Dilger 0 siblings, 2 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-04 15:15 UTC (permalink / raw) To: Andreas Dilger Cc: Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Mon, 2006-12-04 at 00:32 -0700, Andreas Dilger wrote: > > I'm wondering if a corresponding opendirplus() (or similar) would also be > > appropriate to inform the kernel/filesystem that readdirplus() will > > follow, and stat information should be gathered/buffered. Or do most > > implementations wait for the first readdir() before doing any actual work > > anyway? > > I'm not sure what some filesystems might do here. I suppose NFS has weak > enough cache semantics that it _might_ return stale cached data from the > client in order to fill the readdirplus() data, but it is just as likely > that it ships the whole thing to the server and returns everything in > one shot. That would imply everything would be at least as up-to-date > as the opendir(). Whether or not the posix committee decides on readdirplus, I propose that we implement this sort of thing in the kernel via a readdir equivalent to posix_fadvise(). That can give exactly the barrier semantics that they are asking for, and only costs 1 extra syscall as opposed to 2 (opendirplus() and readdirplus()). Cheers Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-04 15:15 ` Trond Myklebust @ 2006-12-05 0:59 ` Rob Ross 2006-12-05 4:44 ` Gary Grider ` (2 more replies) 2006-12-05 10:26 ` readdirplus() as possible POSIX I/O API Andreas Dilger 1 sibling, 3 replies; 124+ messages in thread From: Rob Ross @ 2006-12-05 0:59 UTC (permalink / raw) To: Trond Myklebust Cc: Andreas Dilger, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Hi all, I don't think that the group intended that there be an opendirplus(); rather readdirplus() would simply be called instead of the usual readdir(). We should clarify that. Regarding Peter Staubach's comments about no one ever using the readdirplus() call; well, if people weren't performing this workload in the first place, we wouldn't *need* this sort of call! This call is specifically targeted at improving "ls -l" performance on large directories, and Sage has pointed out quite nicely how that might work. In our case (PVFS), we would essentially perform three phases of communication with the file system for a readdirplus that was obtaining full statistics: first grabbing the directory entries, then obtaining metadata from servers on all objects in bulk, then gathering file sizes in bulk. The reduction in control message traffic is enormous, and the concurrency is much greater than in a readdir()+stat()s workload. We'd never perform this sort of optimization optimistically, as the cost of guessing wrong is just too high. We would want to see the call as a proper VFS operation that we could act upon. The entire readdirplus() operation wasn't intended to be atomic, and in fact the returned structure has space for an error associated with the stat() on a particular entry, to allow for implementations that stat() subsequently and get an error because the object was removed between when the entry was read out of the directory and when the stat was performed. I think this fits well with what Andreas and others are thinking. We should clarify the description appropriately. I don't think that we have a readdirpluslite() variation documented yet? Gary? It would make a lot of sense. Except that it should probably have a better name... Regarding Andreas's note that he would prefer the statlite() flags to mean "valid", that makes good sense to me (and would obviously apply to the so-far even more hypothetical readdirpluslite()). I don't think there's a lot of value in returning possibly-inaccurate values? Thanks everyone, Rob Trond Myklebust wrote: > On Mon, 2006-12-04 at 00:32 -0700, Andreas Dilger wrote: >>> I'm wondering if a corresponding opendirplus() (or similar) would also be >>> appropriate to inform the kernel/filesystem that readdirplus() will >>> follow, and stat information should be gathered/buffered. Or do most >>> implementations wait for the first readdir() before doing any actual work >>> anyway? >> I'm not sure what some filesystems might do here. I suppose NFS has weak >> enough cache semantics that it _might_ return stale cached data from the >> client in order to fill the readdirplus() data, but it is just as likely >> that it ships the whole thing to the server and returns everything in >> one shot. That would imply everything would be at least as up-to-date >> as the opendir(). > > Whether or not the posix committee decides on readdirplus, I propose > that we implement this sort of thing in the kernel via a readdir > equivalent to posix_fadvise(). That can give exactly the barrier > semantics that they are asking for, and only costs 1 extra syscall as > opposed to 2 (opendirplus() and readdirplus()). > > Cheers > Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 0:59 ` Rob Ross @ 2006-12-05 4:44 ` Gary Grider 2006-12-05 10:05 ` Christoph Hellwig 2006-12-05 5:56 ` Trond Myklebust 2006-12-05 14:37 ` Peter Staubach 2 siblings, 1 reply; 124+ messages in thread From: Gary Grider @ 2006-12-05 4:44 UTC (permalink / raw) To: Rob Ross, Trond Myklebust Cc: Andreas Dilger, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, linux-fsdevel At 05:59 PM 12/4/2006, Rob Ross wrote: >Hi all, > >I don't think that the group intended that there be an >opendirplus(); rather readdirplus() would simply be called instead >of the usual readdir(). We should clarify that. > >Regarding Peter Staubach's comments about no one ever using the >readdirplus() call; well, if people weren't performing this workload >in the first place, we wouldn't *need* this sort of call! This call >is specifically targeted at improving "ls -l" performance on large >directories, and Sage has pointed out quite nicely how that might work. > >In our case (PVFS), we would essentially perform three phases of >communication with the file system for a readdirplus that was >obtaining full statistics: first grabbing the directory entries, >then obtaining metadata from servers on all objects in bulk, then >gathering file sizes in bulk. The reduction in control message >traffic is enormous, and the concurrency is much greater than in a >readdir()+stat()s workload. We'd never perform this sort of >optimization optimistically, as the cost of guessing wrong is just >too high. We would want to see the call as a proper VFS operation >that we could act upon. > >The entire readdirplus() operation wasn't intended to be atomic, and >in fact the returned structure has space for an error associated >with the stat() on a particular entry, to allow for implementations >that stat() subsequently and get an error because the object was >removed between when the entry was read out of the directory and >when the stat was performed. I think this fits well with what >Andreas and others are thinking. We should clarify the description >appropriately. > >I don't think that we have a readdirpluslite() variation documented >yet? Gary? It would make a lot of sense. Except that it should >probably have a better name... Correct, we do not have that documented. I suppose we could just have a mask like statlite and keep it to one call perhaps. >Regarding Andreas's note that he would prefer the statlite() flags >to mean "valid", that makes good sense to me (and would obviously >apply to the so-far even more hypothetical readdirpluslite()). I >don't think there's a lot of value in returning possibly-inaccurate values? The one use that some users talk about is just knowing the file is growing is important and useful to them, knowing exactly to the byte how much growth seems less important to them until they close. On these big parallel apps, so many things can happen that can just hang. They often use the presence of checkpoint files and how big they are to gage progress of he application. Of course there are other ways this can be accomplished but they do this sort of thing a lot. That is the main case I have heard that might benefit from "possibly-inaccurate" values. Of course it assumes that the inaccuracy is just old information and not bogus information. Thanks, we will put out a complete version of what we have in a document to the Open Group site in a week or two so all the pages in their current state are available. We could then begin some iteration on all these comments we have gotten from the various communities. Thanks Gary >Thanks everyone, > >Rob > >Trond Myklebust wrote: >>On Mon, 2006-12-04 at 00:32 -0700, Andreas Dilger wrote: >>>>I'm wondering if a corresponding opendirplus() (or similar) would >>>>also be appropriate to inform the kernel/filesystem that >>>>readdirplus() will follow, and stat information should be >>>>gathered/buffered. Or do most implementations wait for the first >>>>readdir() before doing any actual work anyway? >>>I'm not sure what some filesystems might do here. I suppose NFS has weak >>>enough cache semantics that it _might_ return stale cached data from the >>>client in order to fill the readdirplus() data, but it is just as likely >>>that it ships the whole thing to the server and returns everything in >>>one shot. That would imply everything would be at least as up-to-date >>>as the opendir(). >>Whether or not the posix committee decides on readdirplus, I propose >>that we implement this sort of thing in the kernel via a readdir >>equivalent to posix_fadvise(). That can give exactly the barrier >>semantics that they are asking for, and only costs 1 extra syscall as >>opposed to 2 (opendirplus() and readdirplus()). >>Cheers >> Trond > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 4:44 ` Gary Grider @ 2006-12-05 10:05 ` Christoph Hellwig 0 siblings, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-12-05 10:05 UTC (permalink / raw) To: Gary Grider Cc: Rob Ross, Trond Myklebust, Andreas Dilger, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, linux-fsdevel On Mon, Dec 04, 2006 at 09:44:08PM -0700, Gary Grider wrote: > The one use that some users talk about is just knowing the file is > growing is important and useful to them, > knowing exactly to the byte how much growth seems less important to > them until they close. > On these big parallel apps, so many things can happen that can just > hang. They often use > the presence of checkpoint files and how big they are to gage > progress of he application. > Of course there are other ways this can be accomplished but they do > this sort of thing > a lot. That is the main case I have heard that might benefit from > "possibly-inaccurate" values. > Of course it assumes that the inaccuracy is just old information and > not bogus information. There are better ways to do it but we refuse to do it right is hardly an option to add kernel bloat.. > Thanks, we will put out a complete version of what we have in a > document to the Open Group > site in a week or two so all the pages in their current state are > available. We could then > begin some iteration on all these comments we have gotten from the > various communities. Could you please stop putting out specs until you actually have working code? There's absolutely no point in standardizing things until it's actually used in practice. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 0:59 ` Rob Ross 2006-12-05 4:44 ` Gary Grider @ 2006-12-05 5:56 ` Trond Myklebust 2006-12-05 10:07 ` Christoph Hellwig 2006-12-05 14:37 ` Peter Staubach 2 siblings, 1 reply; 124+ messages in thread From: Trond Myklebust @ 2006-12-05 5:56 UTC (permalink / raw) To: Rob Ross Cc: Andreas Dilger, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Mon, 2006-12-04 at 18:59 -0600, Rob Ross wrote: > Hi all, > > I don't think that the group intended that there be an opendirplus(); > rather readdirplus() would simply be called instead of the usual > readdir(). We should clarify that. > > Regarding Peter Staubach's comments about no one ever using the > readdirplus() call; well, if people weren't performing this workload in > the first place, we wouldn't *need* this sort of call! This call is > specifically targeted at improving "ls -l" performance on large > directories, and Sage has pointed out quite nicely how that might work. ...and we have pointed out how nicely this ignores the realities of current caching models. There is no need for a readdirplus() system call. There may be a need for a caching barrier, but AFAICS that is all. Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 5:56 ` Trond Myklebust @ 2006-12-05 10:07 ` Christoph Hellwig 2006-12-05 14:20 ` Matthew Wilcox ` (3 more replies) 0 siblings, 4 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-12-05 10:07 UTC (permalink / raw) To: Trond Myklebust Cc: Rob Ross, Andreas Dilger, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Tue, Dec 05, 2006 at 12:56:40AM -0500, Trond Myklebust wrote: > On Mon, 2006-12-04 at 18:59 -0600, Rob Ross wrote: > > Hi all, > > > > I don't think that the group intended that there be an opendirplus(); > > rather readdirplus() would simply be called instead of the usual > > readdir(). We should clarify that. > > > > Regarding Peter Staubach's comments about no one ever using the > > readdirplus() call; well, if people weren't performing this workload in > > the first place, we wouldn't *need* this sort of call! This call is > > specifically targeted at improving "ls -l" performance on large > > directories, and Sage has pointed out quite nicely how that might work. > > ...and we have pointed out how nicely this ignores the realities of > current caching models. There is no need for a readdirplus() system > call. There may be a need for a caching barrier, but AFAICS that is all. I think Andreas mentioned that it is useful for clustered filesystems that can avoid additional roundtrips this way. That alone might now be enough reason for API additions, though. The again statlite and readdirplus really are the most sane bits of these proposals as they fit nicely into the existing set of APIs. The filehandle idiocy on the other hand is way of into crackpipe land. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 10:07 ` Christoph Hellwig @ 2006-12-05 14:20 ` Matthew Wilcox 2006-12-06 15:04 ` Rob Ross 2006-12-05 14:55 ` Trond Myklebust ` (2 subsequent siblings) 3 siblings, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-12-05 14:20 UTC (permalink / raw) To: Christoph Hellwig Cc: Trond Myklebust, Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Tue, Dec 05, 2006 at 10:07:48AM +0000, Christoph Hellwig wrote: > The filehandle idiocy on the other hand is way of into crackpipe land. Right, and it needs to be discarded. Of course, there was a real problem that it addressed, so we need to come up with an acceptable alternative. The scenario is a cluster-wide application doing simultaneous opens of the same file. So thousands of nodes all hitting the same DLM locks (for read) all at once. The openg() non-solution implies that all nodes in the cluster share the same filehandle space, so I think a reasonable solution can be implemented entirely within the clusterfs, with an extra flag to open(), say O_CLUSTER_WIDE. When the clusterfs sees this flag set (in ->lookup), it can treat it as a hint that this pathname component is likely to be opened again on other nodes and broadcast that fact to the other nodes within the cluster. Other nodes on seeing that hint (which could be structured as "The child "bin" of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact. When they see their own open, they can populate the path to that file without asking the server for extra metadata. There's obviously security issues there (why I say 'hint' rather than 'command'), but there's also security problems with open-by-filehandle. Note that this solution requires no syscall changes, no application changes, and also helps a scenario where each node opens a different file in the same directory. I've never worked on a clusterfs, so there may be some gotchas (eg, how do you invalidate the caches of nodes when you do a rename). But this has to be preferable to open-by-fh. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 14:20 ` Matthew Wilcox @ 2006-12-06 15:04 ` Rob Ross 2006-12-06 15:44 ` Matthew Wilcox 0 siblings, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-06 15:04 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Matthew Wilcox wrote: > On Tue, Dec 05, 2006 at 10:07:48AM +0000, Christoph Hellwig wrote: >> The filehandle idiocy on the other hand is way of into crackpipe land. > > Right, and it needs to be discarded. Of course, there was a real > problem that it addressed, so we need to come up with an acceptable > alternative. > > The scenario is a cluster-wide application doing simultaneous opens of > the same file. So thousands of nodes all hitting the same DLM locks > (for read) all at once. The openg() non-solution implies that all > nodes in the cluster share the same filehandle space, so I think a > reasonable solution can be implemented entirely within the clusterfs, > with an extra flag to open(), say O_CLUSTER_WIDE. When the clusterfs > sees this flag set (in ->lookup), it can treat it as a hint that this > pathname component is likely to be opened again on other nodes and > broadcast that fact to the other nodes within the cluster. Other nodes > on seeing that hint (which could be structured as "The child "bin" > of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle > ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact. > When they see their own open, they can populate the path to that file > without asking the server for extra metadata. > > There's obviously security issues there (why I say 'hint' rather than > 'command'), but there's also security problems with open-by-filehandle. > Note that this solution requires no syscall changes, no application > changes, and also helps a scenario where each node opens a different > file in the same directory. > > I've never worked on a clusterfs, so there may be some gotchas (eg, how > do you invalidate the caches of nodes when you do a rename). But this > has to be preferable to open-by-fh. The openg() solution has the following advantages to what you propose. First, it places the burden of the communication of the file handle on the application process, not the file system. That means less work for the file system. Second, it does not require that clients respond to unexpected network traffic. Third, the network traffic is deterministic -- one client interacts with the file system and then explicitly performs the broadcast. Fourth, it does not require that the file system store additional state on clients. In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing the flag) would likely cause a storm of network traffic if clients were closely synchronized (which they are likely to be). We could work around this by having one application open early, then barrier, then have everyone else open, but then we might as well have just sent the handle as the barrier operation, and we've made the use of the O_CLUSTER_WIDE open() significantly more complicated for the application. However, the application change issue is actually moot; we will make whatever changes inside our MPI-IO implementation, and many users will get the benefits for free. The readdirplus(), readx()/writex(), and openg()/openfh() were all designed to allow our applications to explain exactly what they wanted and to allow for explicit communication. I understand that there is a tendency toward solutions where the FS guesses what the app is going to do or is passed a hint (e.g. fadvise) about what is going to happen, because these things don't require interface changes. But these solutions just aren't as effective as actually spelling out what the application wants. Regards, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 15:04 ` Rob Ross @ 2006-12-06 15:44 ` Matthew Wilcox 2006-12-06 16:15 ` Rob Ross 0 siblings, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-12-06 15:44 UTC (permalink / raw) To: Rob Ross Cc: Christoph Hellwig, Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote: > The openg() solution has the following advantages to what you propose. > First, it places the burden of the communication of the file handle on > the application process, not the file system. That means less work for > the file system. Second, it does not require that clients respond to > unexpected network traffic. Third, the network traffic is deterministic > -- one client interacts with the file system and then explicitly > performs the broadcast. Fourth, it does not require that the file system > store additional state on clients. You didn't address the disadvantages I pointed out on December 1st in a mail to the posix mailing list: : I now understand this not so much as a replacement for dup() but in : terms of being able to open by NFS filehandle, or inode number. The : fh_t is presumably generated by the underlying cluster filesystem, and : is a handle that has meaning on all nodes that are members of the : cluster. : : I think we need to consider security issues (that have also come up : when open-by-inode-number was proposed). For example, how long is the : fh_t intended to be valid for? Forever? Until the cluster is rebooted? : Could the fh_t be used by any user, or only those with credentials to : access the file? What happens if we revoke() the original fd? : : I'm a little concerned about the generation of a suitable fh_t. : In the implementation of sutoc(), how does the kernel know which : filesystem to ask to translate it? It's not impossible (though it is : implausible) that an fh_t could be meaningful to more than one : filesystem. : : One possibility of fixing this could be to use a magic number at the : beginning of the fh_t to distinguish which filesystem this belongs : to (a list of currently-used magic numbers in Linux can be found at : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h) Christoph has also touched on some of these points, and added some I missed. > In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing > the flag) would likely cause a storm of network traffic if clients were > closely synchronized (which they are likely to be). I think you're referring to a naive application, rather than a naive cluster filesystem, right? There's several ways to fix that problem, including throttling broadcasts of information, having nodes ask their immediate neighbours if they have a cache of the information, and having the server not respond (wait for a retransmit) if it's recently sent out a broadcast. > However, the application change issue is actually moot; we will make > whatever changes inside our MPI-IO implementation, and many users will > get the benefits for free. That's good. > The readdirplus(), readx()/writex(), and openg()/openfh() were all > designed to allow our applications to explain exactly what they wanted > and to allow for explicit communication. I understand that there is a > tendency toward solutions where the FS guesses what the app is going to > do or is passed a hint (e.g. fadvise) about what is going to happen, > because these things don't require interface changes. But these > solutions just aren't as effective as actually spelling out what the > application wants. Sure, but I think you're emphasising "these interfaces let us get our job done" over the legitimate concerns that we have. I haven't really looked at the readdirplus() or readx()/writex() interfaces, but the security problems with openg() makes me think you haven't really looked at it from the "what could go wrong" perspective. I'd be interested in reviewing the readx()/writex() interfaces, but still don't see a document for them anywhere. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 15:44 ` Matthew Wilcox @ 2006-12-06 16:15 ` Rob Ross 0 siblings, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-06 16:15 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Matthew Wilcox wrote: > On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote: >> The openg() solution has the following advantages to what you propose. >> First, it places the burden of the communication of the file handle on >> the application process, not the file system. That means less work for >> the file system. Second, it does not require that clients respond to >> unexpected network traffic. Third, the network traffic is deterministic >> -- one client interacts with the file system and then explicitly >> performs the broadcast. Fourth, it does not require that the file system >> store additional state on clients. > > You didn't address the disadvantages I pointed out on December 1st in a > mail to the posix mailing list: I coincidentally just wrote about some of this in another email. Wasn't trying to avoid you... > : I now understand this not so much as a replacement for dup() but in > : terms of being able to open by NFS filehandle, or inode number. The > : fh_t is presumably generated by the underlying cluster filesystem, and > : is a handle that has meaning on all nodes that are members of the > : cluster. Exactly. > : I think we need to consider security issues (that have also come up > : when open-by-inode-number was proposed). For example, how long is the > : fh_t intended to be valid for? Forever? Until the cluster is rebooted? > : Could the fh_t be used by any user, or only those with credentials to > : access the file? What happens if we revoke() the original fd? The fh_t would be validated either (a) when the openfh() is called, or on accesses using the associated capability. As Christoph pointed out, this really is a capability and encapsulates everything necessary for a particular user to access a particular file. It can be handed to others, and in fact that is a critical feature for our use case. After the openfh(), the access model is identical to a previously open()ed file. So the question is what happens between the openg() and the openfh(). Our intention was to allow servers to "forget" these fh_ts at will. So a revoke between openg() and openfh() would kill the fh_t, and the subsequent openfh() would fail, or subsequent accesses would fail (depending on when the FS chose to validate). Does this help? > : I'm a little concerned about the generation of a suitable fh_t. > : In the implementation of sutoc(), how does the kernel know which > : filesystem to ask to translate it? It's not impossible (though it is > : implausible) that an fh_t could be meaningful to more than one > : filesystem. > : > : One possibility of fixing this could be to use a magic number at the > : beginning of the fh_t to distinguish which filesystem this belongs > : to (a list of currently-used magic numbers in Linux can be found at > : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h) > > Christoph has also touched on some of these points, and added some I > missed. We could use advice on this point. Certainly it's possible to encode information about the FS from which the fh_t originated, but we haven't tried to spell out exactly how that would happen. Your approach described here sounds good to me. >> In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing >> the flag) would likely cause a storm of network traffic if clients were >> closely synchronized (which they are likely to be). > > I think you're referring to a naive application, rather than a naive > cluster filesystem, right? There's several ways to fix that problem, > including throttling broadcasts of information, having nodes ask their > immediate neighbours if they have a cache of the information, and having > the server not respond (wait for a retransmit) if it's recently sent out > a broadcast. Yes, naive application. You're right that the file system could adapt to this, but on the other hand if we were explicitly passing the fh_t in user space, we could just use MPI_Bcast and be done with it, with an algorithm that is well-matched to the system, etc. >> However, the application change issue is actually moot; we will make >> whatever changes inside our MPI-IO implementation, and many users will >> get the benefits for free. > > That's good. Absolutely. Same goes for readx()/writex() also, BTW, at least for MPI-IO users. We will build the input parameters inside MPI-IO using existing information from users, rather than applying data sieving or using multiple POSIX calls. >> The readdirplus(), readx()/writex(), and openg()/openfh() were all >> designed to allow our applications to explain exactly what they wanted >> and to allow for explicit communication. I understand that there is a >> tendency toward solutions where the FS guesses what the app is going to >> do or is passed a hint (e.g. fadvise) about what is going to happen, >> because these things don't require interface changes. But these >> solutions just aren't as effective as actually spelling out what the >> application wants. > > Sure, but I think you're emphasising "these interfaces let us get our > job done" over the legitimate concerns that we have. I haven't really > looked at the readdirplus() or readx()/writex() interfaces, but the > security problems with openg() makes me think you haven't really looked > at it from the "what could go wrong" perspective. I'm sorry if it seems like I'm ignoring your concerns; that isn't my intention. I am advocating the calls though, because the whole point in getting into these discussions is to improve the state of things for these access patterns. Part of the problem is that the descriptions of these calls were written for inclusion in a POSIX document and not for discussion on this list. Those descriptions don't usually include detailed descriptions of implementation options or use cases. We should have created some additional documentation before coming to this list, but what is done is done. In the case of openg(), the major approach to things "going wrong" is for the server to just forget it ever handed out the fh_t and make the application figure it out. We think that makes implementations relatively simple, because we don't require so much. It makes using this capability a little more difficult outside the kernel, but we're prepared for that. > I'd be interested in > reviewing the readx()/writex() interfaces, but still don't see a document > for them anywhere. Really? Ack! Ok. I'll talk with the others and get a readx()/writex() page up soon, although it would be nice to let the discussion of these few calm down a bit before we start with those...I'm not getting much done at work right now :). Thanks for the discussion, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 10:07 ` Christoph Hellwig 2006-12-05 14:20 ` Matthew Wilcox @ 2006-12-05 14:55 ` Trond Myklebust 2006-12-05 22:11 ` Rob Ross 2006-12-06 12:22 ` Ragnar Kjørstad 2006-12-05 16:55 ` Latchesar Ionkov 2006-12-05 21:50 ` Rob Ross 3 siblings, 2 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-05 14:55 UTC (permalink / raw) To: Christoph Hellwig Cc: Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Tue, 2006-12-05 at 10:07 +0000, Christoph Hellwig wrote: > > ...and we have pointed out how nicely this ignores the realities of > > current caching models. There is no need for a readdirplus() system > > call. There may be a need for a caching barrier, but AFAICS that is all. > > I think Andreas mentioned that it is useful for clustered filesystems > that can avoid additional roundtrips this way. That alone might now > be enough reason for API additions, though. The again statlite and > readdirplus really are the most sane bits of these proposals as they > fit nicely into the existing set of APIs. The filehandle idiocy on > the other hand is way of into crackpipe land. They provide no benefits whatsoever for the two most commonly used networked filesystems NFS and CIFS. As far as they are concerned, the only new thing added by readdirplus() is the caching barrier semantics. I don't see why you would want to add that into a generic syscall like readdir() though: it is a) networked filesystem specific. The mask stuff etc adds no value whatsoever to actual "posix" filesystems. In fact it is telling the kernel that it can violate posix semantics. b) quite unnatural to impose caching semantics on all the directory _entries_ using a syscall that refers to the directory itself (see the explanations by both myself and Peter Staubach of the synchronisation difficulties). Consider in particular that it is quite possible for directory contents to change in between readdirplus calls. i.e. the "strict posix caching model' is pretty much impossible to implement on something like NFS or CIFS using these semantics. Why then even bother to have "masks" to tell you when it is OK to violate said strict model. c) Says nothing about what should happen to non-stat() metadata such as ACL information and other extended attributes (for example future selinux context info). You would think that the 'ls -l' application would care about this. Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 14:55 ` Trond Myklebust @ 2006-12-05 22:11 ` Rob Ross 2006-12-05 23:24 ` Trond Myklebust 2006-12-06 12:22 ` Ragnar Kjørstad 1 sibling, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-05 22:11 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Trond Myklebust wrote: > On Tue, 2006-12-05 at 10:07 +0000, Christoph Hellwig wrote: >>> ...and we have pointed out how nicely this ignores the realities of >>> current caching models. There is no need for a readdirplus() system >>> call. There may be a need for a caching barrier, but AFAICS that is all. >> I think Andreas mentioned that it is useful for clustered filesystems >> that can avoid additional roundtrips this way. That alone might now >> be enough reason for API additions, though. The again statlite and >> readdirplus really are the most sane bits of these proposals as they >> fit nicely into the existing set of APIs. The filehandle idiocy on >> the other hand is way of into crackpipe land. > > They provide no benefits whatsoever for the two most commonly used > networked filesystems NFS and CIFS. As far as they are concerned, the > only new thing added by readdirplus() is the caching barrier semantics. > I don't see why you would want to add that into a generic syscall like > readdir() though: it is > > a) networked filesystem specific. The mask stuff etc adds no > value whatsoever to actual "posix" filesystems. In fact it is > telling the kernel that it can violate posix semantics. It isn't violating POSIX semantics if we get the calls passed as an extension to POSIX :). > b) quite unnatural to impose caching semantics on all the > directory _entries_ using a syscall that refers to the directory > itself (see the explanations by both myself and Peter Staubach > of the synchronisation difficulties). Consider in particular > that it is quite possible for directory contents to change in > between readdirplus calls. I want to make sure that I understand this correctly. NFS semantics dictate that if someone stat()s a file that all changes from that client need to be propagated to the server? And this call complicates that semantic because now there's an operation on a different object (the directory) that would cause this flush on the files? Of course directory contents can change in between readdirplus() calls, just as they can between readdir() calls. That's expected, and we do not attempt to create consistency between calls. > i.e. the "strict posix caching model' is pretty much impossible > to implement on something like NFS or CIFS using these > semantics. Why then even bother to have "masks" to tell you when > it is OK to violate said strict model. We're trying to obtain improved performance for distributed file systems with stronger consistency guarantees than these two. > c) Says nothing about what should happen to non-stat() metadata > such as ACL information and other extended attributes (for > example future selinux context info). You would think that the > 'ls -l' application would care about this. Honestly, we hadn't thought about other non-stat() metadata because we didn't think it was part of the use case, and we were trying to stay close to the flavor of POSIX. If you have ideas here, we'd like to hear them. Thanks for the comments, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 22:11 ` Rob Ross @ 2006-12-05 23:24 ` Trond Myklebust 2006-12-06 16:42 ` Rob Ross 0 siblings, 1 reply; 124+ messages in thread From: Trond Myklebust @ 2006-12-05 23:24 UTC (permalink / raw) To: Rob Ross Cc: Christoph Hellwig, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Tue, 2006-12-05 at 16:11 -0600, Rob Ross wrote: > Trond Myklebust wrote: > > b) quite unnatural to impose caching semantics on all the > > directory _entries_ using a syscall that refers to the directory > > itself (see the explanations by both myself and Peter Staubach > > of the synchronisation difficulties). Consider in particular > > that it is quite possible for directory contents to change in > > between readdirplus calls. > > I want to make sure that I understand this correctly. NFS semantics > dictate that if someone stat()s a file that all changes from that client > need to be propagated to the server? And this call complicates that > semantic because now there's an operation on a different object (the > directory) that would cause this flush on the files? The only way for an NFS client to obey the POSIX requirement that write() immediately updates the mtime/ctime is to flush out all cached writes whenever the user requests stat() information. > > i.e. the "strict posix caching model' is pretty much impossible > > to implement on something like NFS or CIFS using these > > semantics. Why then even bother to have "masks" to tell you when > > it is OK to violate said strict model. > > We're trying to obtain improved performance for distributed file systems > with stronger consistency guarantees than these two. So you're saying I should ignore this thread. Fine... > > c) Says nothing about what should happen to non-stat() metadata > > such as ACL information and other extended attributes (for > > example future selinux context info). You would think that the > > 'ls -l' application would care about this. > > Honestly, we hadn't thought about other non-stat() metadata because we > didn't think it was part of the use case, and we were trying to stay > close to the flavor of POSIX. If you have ideas here, we'd like to hear > them. See my previous postings. Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 23:24 ` Trond Myklebust @ 2006-12-06 16:42 ` Rob Ross 0 siblings, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-06 16:42 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Trond Myklebust wrote: > On Tue, 2006-12-05 at 16:11 -0600, Rob Ross wrote: >> Trond Myklebust wrote: >>> b) quite unnatural to impose caching semantics on all the >>> directory _entries_ using a syscall that refers to the directory >>> itself (see the explanations by both myself and Peter Staubach >>> of the synchronisation difficulties). Consider in particular >>> that it is quite possible for directory contents to change in >>> between readdirplus calls. >> I want to make sure that I understand this correctly. NFS semantics >> dictate that if someone stat()s a file that all changes from that client >> need to be propagated to the server? And this call complicates that >> semantic because now there's an operation on a different object (the >> directory) that would cause this flush on the files? > > The only way for an NFS client to obey the POSIX requirement that > write() immediately updates the mtime/ctime is to flush out all cached > writes whenever the user requests stat() information. Thanks for explaining this. I've never understood how it is decided where the line is drawn with respect to where NFS does obey POSIX semantics for a particular implementation. Writes on other nodes wouldn't necessarily have updated mtime/ctime, right? >>> i.e. the "strict posix caching model' is pretty much impossible >>> to implement on something like NFS or CIFS using these >>> semantics. Why then even bother to have "masks" to tell you when >>> it is OK to violate said strict model. >> We're trying to obtain improved performance for distributed file systems >> with stronger consistency guarantees than these two. > > So you're saying I should ignore this thread. Fine... No I'm trying to explain when the calls might be useful. But if you are only interested in NFS and CIFS, then I guess the thread might not be very interesting. >>> c) Says nothing about what should happen to non-stat() metadata >>> such as ACL information and other extended attributes (for >>> example future selinux context info). You would think that the >>> 'ls -l' application would care about this. >> Honestly, we hadn't thought about other non-stat() metadata because we >> didn't think it was part of the use case, and we were trying to stay >> close to the flavor of POSIX. If you have ideas here, we'd like to hear >> them. > > See my previous postings. I'll do that. Thanks. Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 14:55 ` Trond Myklebust 2006-12-05 22:11 ` Rob Ross @ 2006-12-06 12:22 ` Ragnar Kjørstad 2006-12-06 15:14 ` Trond Myklebust 1 sibling, 1 reply; 124+ messages in thread From: Ragnar Kjørstad @ 2006-12-06 12:22 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Tue, Dec 05, 2006 at 09:55:16AM -0500, Trond Myklebust wrote: > > The again statlite and > > readdirplus really are the most sane bits of these proposals as they > > fit nicely into the existing set of APIs. The filehandle idiocy on > > the other hand is way of into crackpipe land. > > ... > > a) networked filesystem specific. The mask stuff etc adds no > value whatsoever to actual "posix" filesystems. In fact it is > telling the kernel that it can violate posix semantics. I don't see what's network filesystem specific about it. Correct me if I'm wrong, but today ls -l on a local filesystem will first do readdir and then n stat calls. In the worst case scenario this will generate n+1 disk seeks. Local filesystems go through a lot of trouble to try to make the disk layout of the directory entries and the inodes optimal so that readahead and caching reduces the number of seeks. With readdirplus on the other hand, the filesystem would be able to send all the requests to the block layer and it would be free to optimize through disk elevators and what not. And this is not simply an "ls -l" optimization. Allthough I can no loger remember why, I think this is exactly what imap servers are doing when opening up big imap folders stored in maildir. -- Ragnar Kjørstad Software Engineer Scali - http://www.scali.com Scaling the Linux Datacenter - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 12:22 ` Ragnar Kjørstad @ 2006-12-06 15:14 ` Trond Myklebust 0 siblings, 0 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-06 15:14 UTC (permalink / raw) To: Ragnar Kjørstad Cc: Christoph Hellwig, Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Wed, 2006-12-06 at 13:22 +0100, Ragnar Kjørstad wrote: > On Tue, Dec 05, 2006 at 09:55:16AM -0500, Trond Myklebust wrote: > > > The again statlite and > > > readdirplus really are the most sane bits of these proposals as they > > > fit nicely into the existing set of APIs. The filehandle idiocy on > > > the other hand is way of into crackpipe land. > > > > ... > > > > a) networked filesystem specific. The mask stuff etc adds no > > value whatsoever to actual "posix" filesystems. In fact it is > > telling the kernel that it can violate posix semantics. > > > I don't see what's network filesystem specific about it. Correct me if > I'm wrong, but today ls -l on a local filesystem will first do readdir > and then n stat calls. In the worst case scenario this will generate n+1 > disk seeks. I was referring to the caching mask. > Local filesystems go through a lot of trouble to try to make the disk > layout of the directory entries and the inodes optimal so that readahead > and caching reduces the number of seeks. > > With readdirplus on the other hand, the filesystem would be able to send > all the requests to the block layer and it would be free to optimize > through disk elevators and what not. As far as local filesystems are concerned, the procedure is still the same: read the directory contents, then do lookup() of the files returned to you by directory contents, then do getattr(). There is no way to magically queue up the getattr() calls before you have done the lookups, nor is there a way to magically queue up the lookups before you have read the directory contents. Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 10:07 ` Christoph Hellwig 2006-12-05 14:20 ` Matthew Wilcox 2006-12-05 14:55 ` Trond Myklebust @ 2006-12-05 16:55 ` Latchesar Ionkov 2006-12-05 22:12 ` Christoph Hellwig 2006-12-05 21:50 ` Rob Ross 3 siblings, 1 reply; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-05 16:55 UTC (permalink / raw) To: Christoph Hellwig Cc: Trond Myklebust, Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On 12/5/06, Christoph Hellwig <hch@infradead.org> wrote: > The filehandle idiocy on the other hand is way of into crackpipe land. What is your opinion on giving the file system an option to lookup a file more than one name/directory at a time? I think that all remote file systems can benefit from that? Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 16:55 ` Latchesar Ionkov @ 2006-12-05 22:12 ` Christoph Hellwig 2006-12-06 23:12 ` Latchesar Ionkov 0 siblings, 1 reply; 124+ messages in thread From: Christoph Hellwig @ 2006-12-05 22:12 UTC (permalink / raw) To: Latchesar Ionkov Cc: Christoph Hellwig, Trond Myklebust, Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Tue, Dec 05, 2006 at 05:55:14PM +0100, Latchesar Ionkov wrote: > What is your opinion on giving the file system an option to lookup a > file more than one name/directory at a time? I think that all remote > file systems can benefit from that? Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP routine get the entire remaining path and is allowed to resolve as much of it as it can (or wants)? While this allows remote filesystems to optimize deep tree traversals it creates a pretty big mess about state that is kept on lookup operations. For Linux in particular it would mean doing large parts of __link_path_walk in the filesystem, which I can't thing of a sane way to do. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 22:12 ` Christoph Hellwig @ 2006-12-06 23:12 ` Latchesar Ionkov 2006-12-06 23:33 ` Trond Myklebust 0 siblings, 1 reply; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-06 23:12 UTC (permalink / raw) To: Christoph Hellwig Cc: Trond Myklebust, Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On 12/5/06, Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Dec 05, 2006 at 05:55:14PM +0100, Latchesar Ionkov wrote: > > What is your opinion on giving the file system an option to lookup a > > file more than one name/directory at a time? I think that all remote > > file systems can benefit from that? > > Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP > routine get the entire remaining path and is allowed to resolve as much of > it as it can (or wants)? > > While this allows remote filesystems to optimize deep tree traversals it > creates a pretty big mess about state that is kept on lookup operations. > > For Linux in particular it would mean doing large parts of __link_path_walk > in the filesystem, which I can't thing of a sane way to do. The way I was thinking of implementing it is leaving all the hard parts of the name resolution in __link_path_walk and modifying inode's lookup operation to accept an array of qstrs (and its size). lookup would also check and revalidate the dentries if necessary (usually the same operation as looking up a name for the remote filesystems). lookup will check if it reaches symbolic link or mountpoint and will stop resolving any further. __link_path_walk will use the name to fill an array of qstrs (we can choose some sane size of the array, like 8 or 16), then call (directly or indirectly) ->lookup (nd->flags will reflect the flags for the last element in the array), check if the inode of the last dentry is symlink, and do what it currently does for symlinks. Does that make sense? Am I missing anything? Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 23:12 ` Latchesar Ionkov @ 2006-12-06 23:33 ` Trond Myklebust 0 siblings, 0 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-06 23:33 UTC (permalink / raw) To: Latchesar Ionkov Cc: Christoph Hellwig, Rob Ross, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Wed, 2006-12-06 at 18:12 -0500, Latchesar Ionkov wrote: > On 12/5/06, Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, Dec 05, 2006 at 05:55:14PM +0100, Latchesar Ionkov wrote: > > > What is your opinion on giving the file system an option to lookup a > > > file more than one name/directory at a time? I think that all remote > > > file systems can benefit from that? > > > > Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP > > routine get the entire remaining path and is allowed to resolve as much of > > it as it can (or wants)? > > > > While this allows remote filesystems to optimize deep tree traversals it > > creates a pretty big mess about state that is kept on lookup operations. > > > > For Linux in particular it would mean doing large parts of __link_path_walk > > in the filesystem, which I can't thing of a sane way to do. > > The way I was thinking of implementing it is leaving all the hard > parts of the name resolution in __link_path_walk and modifying inode's > lookup operation to accept an array of qstrs (and its size). lookup > would also check and revalidate the dentries if necessary (usually the > same operation as looking up a name for the remote filesystems). I beg to differ. Revalidation is not the same as looking up: the locking rules are _very_ different. > lookup will check if it reaches symbolic link or mountpoint and will > stop resolving any further. __link_path_walk will use the name to fill > an array of qstrs (we can choose some sane size of the array, like 8 > or 16), then call (directly or indirectly) ->lookup (nd->flags will > reflect the flags for the last element in the array), check if the > inode of the last dentry is symlink, and do what it currently does for > symlinks. > > Does that make sense? Am I missing anything? Again: locking. How do you keep the dcache sane while the filesystem is doing a jumble of revalidation and new lookups. Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 10:07 ` Christoph Hellwig ` (2 preceding siblings ...) 2006-12-05 16:55 ` Latchesar Ionkov @ 2006-12-05 21:50 ` Rob Ross 2006-12-05 22:05 ` Christoph Hellwig 3 siblings, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-05 21:50 UTC (permalink / raw) To: Christoph Hellwig Cc: Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Christoph Hellwig wrote: > On Tue, Dec 05, 2006 at 12:56:40AM -0500, Trond Myklebust wrote: >> On Mon, 2006-12-04 at 18:59 -0600, Rob Ross wrote: >>> Hi all, >>> >>> I don't think that the group intended that there be an opendirplus(); >>> rather readdirplus() would simply be called instead of the usual >>> readdir(). We should clarify that. >>> >>> Regarding Peter Staubach's comments about no one ever using the >>> readdirplus() call; well, if people weren't performing this workload in >>> the first place, we wouldn't *need* this sort of call! This call is >>> specifically targeted at improving "ls -l" performance on large >>> directories, and Sage has pointed out quite nicely how that might work. >> ...and we have pointed out how nicely this ignores the realities of >> current caching models. There is no need for a readdirplus() system >> call. There may be a need for a caching barrier, but AFAICS that is all. > > I think Andreas mentioned that it is useful for clustered filesystems > that can avoid additional roundtrips this way. That alone might now > be enough reason for API additions, though. The again statlite and > readdirplus really are the most sane bits of these proposals as they > fit nicely into the existing set of APIs. The filehandle idiocy on > the other hand is way of into crackpipe land. So other than this "lite" version of the readdirplus() call, and this idea of making the flags indicate validity rather than accuracy, are there other comments on the directory-related calls? I understand that they might or might not ever make it in, but assuming they did, what other changes would you like to see? Thanks, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 21:50 ` Rob Ross @ 2006-12-05 22:05 ` Christoph Hellwig 2006-12-05 23:18 ` Sage Weil ` (2 more replies) 0 siblings, 3 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-12-05 22:05 UTC (permalink / raw) To: Rob Ross Cc: Christoph Hellwig, Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel, drepper I'd like to Cc Ulrich Drepper in this thread because he's going to decide what APIs will be exposed at the C library level in the end, and he also has quite a lot of experience with the various standardization bodies. Ulrich, this in reply to these API proposals: http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf http://www.opengroup.org/platform/hecewg/uploads/40/10898/POSIX-stat-manpages.pdf On Tue, Dec 05, 2006 at 03:50:40PM -0600, Rob Ross wrote: > So other than this "lite" version of the readdirplus() call, and this > idea of making the flags indicate validity rather than accuracy, are > there other comments on the directory-related calls? I understand that > they might or might not ever make it in, but assuming they did, what > other changes would you like to see? statlite needs to separate the flag for valid fields from the actual stat structure and reuse the existing stat(64) structure. stat lite needs to at least get a better name, even better be folded into *statat*, either by having a new AT_VALID_MASK flag that enables a new unsigned int valid argument or by folding the valid flags into the AT_ flags. It should also drop the notation of required vs optional field. If a filesystem always always has certain values at hand it can just fill them even if they weren't requested. I think having a stat lite variant is pretty much consensus, we just need to fine tune the actual API - and of course get a reference implementation. So if you want to get this going try to implement it based on http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2. Bonus points for actually making use of the flags in some filesystems. Readdir plus is a little more involved. For one thing the actual kernel implementation will be a variant of getdents() call anyway while a readdirplus would only be a library level interface. At the actual C prototype level I would rename d_stat_err to d_stat_errno for consistency and maybe drop the readdirplus() entry point in favour of readdirplus_r only - there is no point in introducing new non-reenetrant APIs today. Also we should not try to put in any of the synchronization or non-caching behaviours mentioned earlier in this thread (they're fortunately not in the pdf mentioned above either). If we ever want to implement these kinds of additional gurantees (which I doubt) that should happen using fadvise calls on the diretory file descriptor. Note that I'm not as sure we really wants this as for the partial stat operation. In doubt get the GFS folks to do the in-kernel infrastructure for their NFS serving needs first and then see what a syscall could help. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 22:05 ` Christoph Hellwig @ 2006-12-05 23:18 ` Sage Weil 2006-12-05 23:55 ` Ulrich Drepper 2006-12-07 23:39 ` NFSv4/pNFS possible POSIX I/O API standards Nikita Danilov 2 siblings, 0 replies; 124+ messages in thread From: Sage Weil @ 2006-12-05 23:18 UTC (permalink / raw) To: Christoph Hellwig Cc: Rob Ross, Trond Myklebust, Andreas Dilger, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel, drepper On Tue, 5 Dec 2006, Christoph Hellwig wrote: > Readdir plus is a little more involved. For one thing the actual kernel > implementation will be a variant of getdents() call anyway while a > readdirplus would only be a library level interface. At the actual > C prototype level I would rename d_stat_err to d_stat_errno for consistency > and maybe drop the readdirplus() entry point in favour of readdirplus_r > only - there is no point in introducing new non-reenetrant APIs today. > > Also we should not try to put in any of the synchronization or non-caching > behaviours mentioned earlier in this thread (they're fortunately not in > the pdf mentioned above either). If we ever want to implement these kinds > of additional gurantees (which I doubt) that should happen using fadvise > calls on the diretory file descriptor. Can you explain what the struct stat result portion of readdirplus() should mean in this case? My suggestion was that its consistency follow that of the directory entry (i.e. mimic the readdir() specification), which (as far as the POSIX description goes) means it is at least as recent as opendir(). That model seems to work pretty well for readdir() on both local and network filesystems, as it allows buffering and so forth. This is evident from the fact that it's semantics haven't been relaxed by NFS et al (AFAIK). Alternatively, one might specify that the result be valid at the time of the readdirplus() call, but I think everyone agrees that is unnecessary, and more importantly, semantically indistinguishable from a readdir()+stat(). The only other option I've heard seems to be that the validity of stat() not be specified at all. This strikes me as utterly pointless--why create a call whose result has no definition. It's also semantically indistinguishable from a readdir()+statlite(null mask). The fact that NFS and maybe others returned cached results for stat() doesn't seem relevant to how the call is _defined_. If the definition of stat() followed NFS, then it might read something like "returns information about a file that was accurate at some point in the last 30 seconds or so." On the other hand, if readdirplus()'s stat consistency is defined the same way as the dirent, NFS et al are still free to ignore that specification and return cached results, as they already do for stat(). (A 'lite' version of readdirplus() might even let users pick and choose, should the fs support both behaviors, just like statlite().) I don't really care what NFS does, but if readdirplus() is going to be specified at all, it should be defined in a way that makes sense and has some added semantic value. Also, one note about the fadvise() suggestion. I think there's a key distinction between what fadvise() currently does (provide hints to the filesystem for performance optimization) and your proposal, which would allow it to change the consistency semantics of other calls. That might be fine, but it strikes me as a slightly strange thing to specify new functionality that redefines previously defined semantics--even to realign with popular implementations. sage P.S. I should probably mention that I'm not part of the group working on this proposal. I've just been following their progress as it relates to my own distributed filesystem research. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 22:05 ` Christoph Hellwig 2006-12-05 23:18 ` Sage Weil @ 2006-12-05 23:55 ` Ulrich Drepper 2006-12-06 10:06 ` Andreas Dilger 2006-12-14 23:58 ` statlite() Rob Ross 2006-12-07 23:39 ` NFSv4/pNFS possible POSIX I/O API standards Nikita Danilov 2 siblings, 2 replies; 124+ messages in thread From: Ulrich Drepper @ 2006-12-05 23:55 UTC (permalink / raw) To: Christoph Hellwig Cc: Rob Ross, Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Christoph Hellwig wrote: > Ulrich, this in reply to these API proposals: I know the documents. The HECWG was actually supposed to submit an actual draft to the OpenGroup-internal working group but I haven't seen anything yet. I'm not opposed to getting real-world experience first. >> So other than this "lite" version of the readdirplus() call, and this >> idea of making the flags indicate validity rather than accuracy, are >> there other comments on the directory-related calls? I understand that >> they might or might not ever make it in, but assuming they did, what >> other changes would you like to see? I don't think an accuracy flag is useful at all. Programs don't want to use fuzzy information. If you want a fast 'ls -l' then add a mode which doesn't print the fields which are not provided. Don't provide outdated information. Similarly for other programs. > statlite needs to separate the flag for valid fields from the actual > stat structure and reuse the existing stat(64) structure. stat lite > needs to at least get a better name, even better be folded into *statat*, > either by having a new AT_VALID_MASK flag that enables a new > unsigned int valid argument or by folding the valid flags into the AT_ > flags. Yes, this is also my pet peeve with this interface. I don't want to have another data structure. Especially since programs might want to store the value in places where normal stat results are returned. And also yes on 'statat'. I strongly suggest to define only a statat variant. In the standards group I'll vehemently oppose the introduction of yet another superfluous non-*at interface. As for reusing the existing statat interface and magically add another parameter through ellipsis: no. We need to become more type-safe. The userlevel interface needs to be a new one. For the system call there is no such restriction. We can indeed extend the existing syscall. We have appropriate checks for the validity of the flags parameter in place which make such calls backward compatible. > I think having a stat lite variant is pretty much consensus, we just need > to fine tune the actual API - and of course get a reference implementation. > So if you want to get this going try to implement it based on > http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2. > Bonus points for actually making use of the flags in some filesystems. I don't like that approach. The flag parameter should be exclusively an output parameter. By default the kernel should fill in all the fields it has access to. If access is not easily possible then set the bit and clear the field. There are of course certain fields which always should be added. In the proposed man page these are already identified (i.e., those before the st_litemask member). > At the actual > C prototype level I would rename d_stat_err to d_stat_errno for consistency > and maybe drop the readdirplus() entry point in favour of readdirplus_r > only - there is no point in introducing new non-reenetrant APIs today. No, readdirplus should be kept (and yes, readdirplus_r must be added). The reason is that the readdir_r interface is only needed if multiple threads use the _same_ DIR stream. This is hardly ever the case. Forcing everybody to use the _r variant means that we unconditionally have to copy the data in the user-provided buffer. With readdir there is the possibility to just pass back a pointer into the internal buffer read into by getdents. This is how readdir works for most kernel/arch combinations. This requires that the dirent_plus structure matches so it's important to get it right. I'm not comfortable with the current proposal. Yes, having ordinary dirent and stat structure in there is a plus. But we have overlap: - d_ino and st_ino - d_type and parts of st_mode And we have superfluous information - st_dev, the same for all entries, at least this is what readdir assumes I haven't made up my mind yet whether this is enough reason to introduce a new type which isn't made up of the the two structures. And one last point: I haven't seen any discussion why readdirplus should do the equivalent of stat and there is no 'statlite' variant. Are all places for readdir is used non-critical for performance or depend on accurate information? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 23:55 ` Ulrich Drepper @ 2006-12-06 10:06 ` Andreas Dilger 2006-12-06 17:19 ` Ulrich Drepper 2006-12-14 23:58 ` statlite() Rob Ross 1 sibling, 1 reply; 124+ messages in thread From: Andreas Dilger @ 2006-12-06 10:06 UTC (permalink / raw) To: Ulrich Drepper Cc: Christoph Hellwig, Rob Ross, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 05, 2006 15:55 -0800, Ulrich Drepper wrote: > I don't think an accuracy flag is useful at all. Programs don't want to > use fuzzy information. If you want a fast 'ls -l' then add a mode which > doesn't print the fields which are not provided. Don't provide outdated > information. Similarly for other programs. Does this mean you are against the statlite() API entirely, or only against the document's use of the flag as a vague "accuracy" value instead of a hard "valid" value? > Christoph Hellwig wrote: > >I think having a stat lite variant is pretty much consensus, we just need > >to fine tune the actual API - and of course get a reference implementation. > >So if you want to get this going try to implement it based on > >http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2. > >Bonus points for actually making use of the flags in some filesystems. > > I don't like that approach. The flag parameter should be exclusively an > output parameter. By default the kernel should fill in all the fields > it has access to. If access is not easily possible then set the bit and > clear the field. There are of course certain fields which always should > be added. In the proposed man page these are already identified (i.e., > those before the st_litemask member). IMHO, if the application doesn't need a particular field (e.g. "ls -i" doesn't need size, "ls -s" doesn't need the inode number, etc) why should these be filled in if they are not easily accessible? As for what is easily accessible, that needs to be determined by the filesystem itself. It is of course fine if the filesystem fills in values that it has at hand, even if they are not requested, but it shouldn't have to do extra work to fill in values that will not be needed. "ls --color" and "ls -F" are prime examples. It does stat on files only to get the file mode (the file type is already part of many dirent structs). But a clustered filesystem may need to do a lot of work to also get the file size, because it has no way of knowing which stat() fields are needed. > And one last point: I haven't seen any discussion why readdirplus should > do the equivalent of stat and there is no 'statlite' variant. Are all > places for readdir is used non-critical for performance or depend on > accurate information? That was previously suggested by me already. IMHO, there should ONLY be a statlite variant of readdirplus(), and I think most people agree with that part of it (though there is contention on whether readdirplus() is needed at all). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 10:06 ` Andreas Dilger @ 2006-12-06 17:19 ` Ulrich Drepper 2006-12-06 17:27 ` Rob Ross 0 siblings, 1 reply; 124+ messages in thread From: Ulrich Drepper @ 2006-12-06 17:19 UTC (permalink / raw) To: Ulrich Drepper, Christoph Hellwig, Rob Ross, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Andreas Dilger wrote: > Does this mean you are against the statlite() API entirely, or only against > the document's use of the flag as a vague "accuracy" value instead of a > hard "valid" value? I'm against fuzzy values. I've no problems with a bitmap specifying that certain members are not wanted or wanted (probably the later, zero meaning the optional fields are not wanted). > IMHO, if the application doesn't need a particular field (e.g. "ls -i" > doesn't need size, "ls -s" doesn't need the inode number, etc) why should > these be filled in if they are not easily accessible? As for what is > easily accessible, that needs to be determined by the filesystem itself. Is the size not easily accessible? It would surprise me. If yes, then, by all means add it to the list. I'm not against extending the list of members which are optional if it makes sense. But certain information is certainly always easily accessible. > That was previously suggested by me already. IMHO, there should ONLY be > a statlite variant of readdirplus(), and I think most people agree with > that part of it (though there is contention on whether readdirplus() is > needed at all). Indeed. Given there is statlite and we have d_type information, in most situations we won't need more complete stat information. Outside of programs like ls that is. Part of why I wished the lab guys had submitted the draft to the OpenGroup first is that this way they would have to be more detailed on why each and every interface they propose for adding is really needed. Maybe they can do it now and here. What programs really require readdirplus? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 17:19 ` Ulrich Drepper @ 2006-12-06 17:27 ` Rob Ross 2006-12-06 17:42 ` Ulrich Drepper 0 siblings, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-06 17:27 UTC (permalink / raw) To: Ulrich Drepper Cc: Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Ulrich Drepper wrote: > Andreas Dilger wrote: >> Does this mean you are against the statlite() API entirely, or only >> against >> the document's use of the flag as a vague "accuracy" value instead of a >> hard "valid" value? > > I'm against fuzzy values. I've no problems with a bitmap specifying > that certain members are not wanted or wanted (probably the later, zero > meaning the optional fields are not wanted). Thanks for clarifying. >> IMHO, if the application doesn't need a particular field (e.g. "ls -i" >> doesn't need size, "ls -s" doesn't need the inode number, etc) why should >> these be filled in if they are not easily accessible? As for what is >> easily accessible, that needs to be determined by the filesystem itself. > > Is the size not easily accessible? It would surprise me. If yes, then, > by all means add it to the list. I'm not against extending the list of > members which are optional if it makes sense. But certain information > is certainly always easily accessible. File size is definitely one of the more difficult of the parameters, either because (a) it isn't stored in one place but is instead derived, or (b) because a lock has to be obtained to guarantee consistency of the returned value. >> That was previously suggested by me already. IMHO, there should ONLY be >> a statlite variant of readdirplus(), and I think most people agree with >> that part of it (though there is contention on whether readdirplus() is >> needed at all). > > Indeed. Given there is statlite and we have d_type information, in most > situations we won't need more complete stat information. Outside of > programs like ls that is. > > Part of why I wished the lab guys had submitted the draft to the > OpenGroup first is that this way they would have to be more detailed on > why each and every interface they propose for adding is really needed. > Maybe they can do it now and here. What programs really require > readdirplus? I can't speak for everyone, but "ls" is the #1 consumer as far as I am concerned. Regards, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 17:27 ` Rob Ross @ 2006-12-06 17:42 ` Ulrich Drepper 2006-12-06 18:01 ` Ragnar Kjørstad 2006-12-07 5:57 ` Andreas Dilger 0 siblings, 2 replies; 124+ messages in thread From: Ulrich Drepper @ 2006-12-06 17:42 UTC (permalink / raw) To: Rob Ross Cc: Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Rob Ross wrote: > File size is definitely one of the more difficult of the parameters, > either because (a) it isn't stored in one place but is instead derived, > or (b) because a lock has to be obtained to guarantee consistency of the > returned value. OK, and looking at the man page again, it is already on the list in the old proposal and hence optional. I've no problem with that. > I can't speak for everyone, but "ls" is the #1 consumer as far as I am > concerned. So a syscall for ls alone? I think this is more a user problem. For normal plain old 'ls' you get by with readdir. For 'ls -F' and 'ls --color' you mostly get by with readdir+d_type. If you cannot provide d_type info the readdirplus extension does you no good. For the cases when an additional stat is needed (for symlinks, for instance, to test whether they are dangling) readdirplus won't help. So, readdirplus is really only useful for 'ls -l'. But then you need st_size and st_?time. So what is gained with readdirplus? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 17:42 ` Ulrich Drepper @ 2006-12-06 18:01 ` Ragnar Kjørstad 2006-12-06 18:13 ` Ulrich Drepper 2006-12-07 5:57 ` Andreas Dilger 1 sibling, 1 reply; 124+ messages in thread From: Ragnar Kjørstad @ 2006-12-06 18:01 UTC (permalink / raw) To: Ulrich Drepper Cc: Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 09:42:55AM -0800, Ulrich Drepper wrote: > >I can't speak for everyone, but "ls" is the #1 consumer as far as I am > >concerned. > > So a syscall for ls alone? I guess the code needs to be checked, but I would think that: * ls * find * rm -r * chown -R * chmod -R * rsync * various backup software * imap servers are all likely users of readdirplus. Of course the ones that spend the majority of the time doing stat are the ones that would benefit more. -- Ragnar Kjørstad Software Engineer Scali - http://www.scali.com Scaling the Linux Datacenter - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 18:01 ` Ragnar Kjørstad @ 2006-12-06 18:13 ` Ulrich Drepper 2006-12-17 14:41 ` Ragnar Kjørstad 0 siblings, 1 reply; 124+ messages in thread From: Ulrich Drepper @ 2006-12-06 18:13 UTC (permalink / raw) To: Ragnar Kjørstad Cc: Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Ragnar Kjørstad wrote: > I guess the code needs to be checked, but I would think that: > * ls > * find > * rm -r > * chown -R > * chmod -R > * rsync > * various backup software > * imap servers Then somebody do the analysis. And please an analysis which takes into account that some programs might need to be adapted to take advantage of d_type or non-optional data from the proposed statlite. Plus, how often are these commands really used on such filesystems? I'd hope that chown -R or so is a once in a lifetime thing on such filesystems and not worth optimizing for. I'd suggest until such data is provided the readdirplus plans are put on hold. statlite I have no problems with if the semantics is changed as I explained. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 18:13 ` Ulrich Drepper @ 2006-12-17 14:41 ` Ragnar Kjørstad 2006-12-17 19:07 ` Ulrich Drepper 0 siblings, 1 reply; 124+ messages in thread From: Ragnar Kjørstad @ 2006-12-17 14:41 UTC (permalink / raw) To: Ulrich Drepper Cc: Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 10:13:36AM -0800, Ulrich Drepper wrote: > Ragnar Kjørstad wrote: > >I guess the code needs to be checked, but I would think that: > >* ls > >* find > >* rm -r > >* chown -R > >* chmod -R > >* rsync > >* various backup software > >* imap servers > > Then somebody do the analysis. This is by no means a full analysis, but maybe someone will find it useful anyway. All performance tests are done with a directory tree with the lkml archive in maildir format on a local ext3 filesystem. The numbers are systemcall walltime, seen through strace. I think Andreas already wrote that "ls --color" is the default in most distributions and needs to stat every file. ls --color -R kernel_old: 82.27% 176.37s 0.325ms 543332 lstat 17.61% 37.75s 5.860ms 6442 getdents64 0.04% 0.09s 0.018ms 4997 write 0.03% 0.06s 55.462ms 1 execve 0.02% 0.04s 5.255ms 8 poll "find" is already smart enough to not call stat when it's not needed, and make use of d_type when it's available. But in many cases stat is still needed (such as with -user) find kernel_old -not -user 1002: 83.63% 173.11s 0.319ms 543338 lstat 16.31% 33.77s 5.242ms 6442 getdents64 0.03% 0.06s 62.882ms 1 execve 0.01% 0.03s 6.904ms 4 poll 0.01% 0.02s 8.383ms 2 connect rm was a false alarm. It only uses stat to check for directories, and it's already beeing smart about it, not statting directories with n_links==2. chown uses stat to: * check for directories / symlinks / regular files * Only change ownership on files with a specific existing ownership. * Only change ownership if the requested owner does not match the current owner. * Different output when ownership is actually changed from when it's not necessary (in verbose mode). * Reset S_UID, S_GID options after setting ownership in some cases. but it seems the most recent version will not use stat for every file with typical options: chown -R rk kernel_old: 93.30% 463.84s 0.854ms 543337 lchown 6.67% 33.18s 5.151ms 6442 getdents64 0.01% 0.04s 0.036ms 1224 brk 0.00% 0.02s 5.830ms 4 poll 0.00% 0.02s 0.526ms 38 open chmod needs stat to do things like "u+w", but the current implementation uses stat regardless of if it's needed or not. chmod -R o+w kernel_old: 62.50% 358.84s 0.660ms 543337 chmod 30.66% 176.05s 0.324ms 543336 lstat 6.82% 39.17s 6.081ms 6442 getdents64 0.01% 0.05s 54.515ms 1 execve 0.01% 0.05s 0.037ms 1224 brk chmod -R 0755 kernel_old: 61.21% 354.42s 0.652ms 543337 chmod 30.33% 175.61s 0.323ms 543336 lstat 8.46% 48.96s 7.600ms 6442 getdents64 0.01% 0.05s 0.037ms 1224 brk 0.00% 0.01s 13.417ms 1 execve Seems I was wrong about the imap servers. They (at least dovecot) do not use a significant amount of time doing stat when opening folders: 84.90% 24.75s 13.137ms 1884 writev 11.23% 3.27s 204.675ms 16 poll 0.95% 0.28s 0.023ms 11932 open 0.89% 0.26s 0.022ms 12003 pread 0.76% 0.22s 12.239ms 18 getdents64 0.63% 0.18s 0.015ms 11942 close 0.63% 0.18s 0.015ms 11936 fstat I don't think any code inspection is needed to determine that rsync requires stat of every file, regardless of d_type. Initial rsync: rsync -a kernel_old copy 78.23% 2914.59s 5.305ms 549452 read 6.69% 249.17s 0.046ms 5462876 write 4.82% 179.44s 0.330ms 543338 lstat 4.57% 170.33s 0.313ms 543355 open 4.13% 153.79s 0.028ms 5468732 select rsync on identical directories: rsync -a kernel_old copy 61.81% 189.27s 0.348ms 543338 lstat 25.23% 77.25s 15.917ms 4853 select 12.72% 38.94s 6.045ms 6442 getdents64 0.19% 0.57s 0.118ms 4840 write 0.03% 0.08s 3.736ms 22 open tar cjgf incremental kernel_backup.tar kernel_old/ 67.69% 2463.49s 3.030ms 812948 read 22.94% 834.85s 2.565ms 325471 write 7.51% 273.45s 0.252ms 1086675 lstat 0.94% 34.25s 2.658ms 12884 getdents64 0.35% 12.63s 0.023ms 543370 open incremental: 81.71% 171.62s 0.316ms 543342 lstat 16.81% 35.32s 2.741ms 12884 getdents64 1.40% 2.94s 1.930ms 1523 write 0.04% 0.09s 86.668ms 1 wait4 0.02% 0.03s 34.300ms 1 execve > And please an analysis which takes into > account that some programs might need to be adapted to take advantage of > d_type or non-optional data from the proposed statlite. d_type may be useful in some cases, but I think mostly as a replacement for the nlink==2 hacks for directory recursion. There are clearly many stat-heavy examples that can not be optimized with d_type. > Plus, how often are these commands really used on such filesystems? I'd > hope that chown -R or so is a once in a lifetime thing on such > filesystems and not worth optimizing for. I think you're right about chown/chmod beeing rare and should not be the main focus. The other examples on my list is probably better. And they are just examples - there are probably many many others as well. And what do you mean by "such filesystems"? I know this came up in the context of clustered filesystems, but unless I'm missing something fundamentally here readdirplus could be just as useful on local filesystems as clustered filesystems if it allowed parallel execution of the getattrs. -- Ragnar Kjørstad Software Engineer Scali - http://www.scali.com Scaling the Linux Datacenter - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-17 14:41 ` Ragnar Kjørstad @ 2006-12-17 19:07 ` Ulrich Drepper 2006-12-17 19:38 ` Matthew Wilcox 0 siblings, 1 reply; 124+ messages in thread From: Ulrich Drepper @ 2006-12-17 19:07 UTC (permalink / raw) To: Ragnar Kjørstad Cc: Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Ragnar Kjørstad wrote: > I think Andreas already wrote that "ls --color" is the default in most > distributions and needs to stat every file. Remove the :ex entry from LS_COLORS and try again. > "find" is already smart enough to not call stat when it's not needed, > and make use of d_type when it's available. But in many cases stat is > still needed (such as with -user) > > find kernel_old -not -user 1002: > 83.63% 173.11s 0.319ms 543338 lstat > 16.31% 33.77s 5.242ms 6442 getdents64 > 0.03% 0.06s 62.882ms 1 execve > 0.01% 0.03s 6.904ms 4 poll > 0.01% 0.02s 8.383ms 2 connect And how often do the scripts which are in everyday use require such a command? And the same for the other programs. I do not doubt that such a new syscall can potentially be useful. The question is whether it is worth it given _real_ situations on today's systems. And more so: on systems where combining the operations really makes a difference. Exposing new data structures is no small feat. It's always risky since something might require a change and then backward compatibility is an issue. Introducing new syscalls just because a combination of two existing ones happens to be used in some programs is not scalable and not the Unix-way. Small building blocks. Otherwise I'd have more proposals which can be much more widely usable (e.g., syscall to read a file into a freshly mmaped area). Nobody wants to go that route since it would lead to creeping featurism. So it is up to the proponents of readdirplus to show this is not such a situation. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-17 19:07 ` Ulrich Drepper @ 2006-12-17 19:38 ` Matthew Wilcox 2006-12-17 21:51 ` Ulrich Drepper 0 siblings, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-12-17 19:38 UTC (permalink / raw) To: Ulrich Drepper Cc: Ragnar Kj??rstad, Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Sun, Dec 17, 2006 at 11:07:27AM -0800, Ulrich Drepper wrote: > And how often do the scripts which are in everyday use require such a > command? And the same for the other programs. I know that the rsync load is a major factor on kernel.org right now. With all the git trees (particularly the ones that people haven't packed recently), there's a lot of files in a lot of directories. If readdirplus would help this situation, it would definitely have a real world benefit. Obviously, I haven't done any measurements or attempted to quantify what the improvement would be. For those not familiar with a git repo, it has an 'objects' directory with 256 directories named 00 to ff. Each of those directories can contain many files (with names like '8cd5bbfb4763322837cd1f7c621f02ebe22fef') Once a file is written, it is never modified, so all rsync needs to do is be able to compare the timestamps and sizes and notice they haven't changed. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-17 19:38 ` Matthew Wilcox @ 2006-12-17 21:51 ` Ulrich Drepper 2006-12-18 2:57 ` Ragnar Kjørstad 0 siblings, 1 reply; 124+ messages in thread From: Ulrich Drepper @ 2006-12-17 21:51 UTC (permalink / raw) To: Matthew Wilcox Cc: Ragnar Kj??rstad, Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Matthew Wilcox wrote: > I know that the rsync load is a major factor on kernel.org right now. That should be quite easy to quantify then. Move the readdir and stat call next to each other in the sources, pass the struct stat around if necessary, and then count the stat calls which do not originate from the stat following the readdir call. Of course we'll also need the actual improvement which can be achieved by combining the calls. Given the inodes are cached, is there more overhead then finding the right inode? Note that is rsync doesn't already use fstatat() it should do so and this means then that there is no long file path to follow, all file names are local to the directory opened with opendir(). My but feeling is that the improvements are minimal for normal (not cluster etc) filesystems and hence the improvements for kernel.org would be minimal. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-17 21:51 ` Ulrich Drepper @ 2006-12-18 2:57 ` Ragnar Kjørstad 2006-12-18 3:54 ` Gary Grider 0 siblings, 1 reply; 124+ messages in thread From: Ragnar Kjørstad @ 2006-12-18 2:57 UTC (permalink / raw) To: Ulrich Drepper Cc: Matthew Wilcox, Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Sun, Dec 17, 2006 at 01:51:38PM -0800, Ulrich Drepper wrote: > Matthew Wilcox wrote: > >I know that the rsync load is a major factor on kernel.org right now. > > That should be quite easy to quantify then. Move the readdir and stat > call next to each other in the sources, pass the struct stat around if > necessary, and then count the stat calls which do not originate from the > stat following the readdir call. Of course we'll also need the actual > improvement which can be achieved by combining the calls. Given the > inodes are cached, is there more overhead then finding the right inode? > Note that is rsync doesn't already use fstatat() it should do so and > this means then that there is no long file path to follow, all file > names are local to the directory opened with opendir(). > > My but feeling is that the improvements are minimal for normal (not > cluster etc) filesystems and hence the improvements for kernel.org would > be minimal. I don't think the overhead of finding the right inode or the system calls themselves makes a difference at all. E.g. the rsync numbers I listed spend more than 0.3ms per stat syscall. That kind of time is not spent in looking up kernel datastructures - it's spent doing IO. That part that I think is important (and please correct me if I've gotten it all wrong) is to do the IO in parallel. This applies both to local filesystems and clustered filesystems, allthough it would probably be much more significant for clustered filesystems since they would typically have longer latency for each roundtrip. Today there is no good way for an application to stat many files in parallel. You could do it through threading, but with significant overhead and complexity. I'm curious what results one would get by comparing performance of: * application doing readdir and then stat on every single file * application doing readdirplus * application doing readdir and then stat on every file using a lot of threads or an asyncronous stat interface As far as parallel IO goes, I would think that async stat would be nearly as fast as readdirplus? For the clustered filesystem case there may be locking issues that makes readdirplus faster? -- Ragnar Kjørstad Software Engineer Scali - http://www.scali.com Scaling the Linux Datacenter - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-18 2:57 ` Ragnar Kjørstad @ 2006-12-18 3:54 ` Gary Grider 0 siblings, 0 replies; 124+ messages in thread From: Gary Grider @ 2006-12-18 3:54 UTC (permalink / raw) To: Ragnar Kjørstad, Ulrich Drepper Cc: Matthew Wilcox, Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, linux-fsdevel At 07:57 PM 12/17/2006, Ragnar Kjørstad wrote: >On Sun, Dec 17, 2006 at 01:51:38PM -0800, Ulrich Drepper wrote: > > Matthew Wilcox wrote: > > >I know that the rsync load is a major factor on kernel.org right now. > > > > That should be quite easy to quantify then. Move the readdir and stat > > call next to each other in the sources, pass the struct stat around if > > necessary, and then count the stat calls which do not originate from the > > stat following the readdir call. Of course we'll also need the actual > > improvement which can be achieved by combining the calls. Given the > > inodes are cached, is there more overhead then finding the right inode? > > Note that is rsync doesn't already use fstatat() it should do so and > > this means then that there is no long file path to follow, all file > > names are local to the directory opened with opendir(). > > > > My but feeling is that the improvements are minimal for normal (not > > cluster etc) filesystems and hence the improvements for kernel.org would > > be minimal. > >I don't think the overhead of finding the right inode or the system >calls themselves makes a difference at all. E.g. the rsync numbers I >listed spend more than 0.3ms per stat syscall. That kind of time is not >spent in looking up kernel datastructures - it's spent doing IO. > >That part that I think is important (and please correct me if I've >gotten it all wrong) is to do the IO in parallel. This applies both to >local filesystems and clustered filesystems, allthough it would probably >be much more significant for clustered filesystems since they would >typically have longer latency for each roundtrip. Today there is no good >way for an application to stat many files in parallel. You could do it >through threading, but with significant overhead and complexity. > >I'm curious what results one would get by comparing performance of: >* application doing readdir and then stat on every single file >* application doing readdirplus >* application doing readdir and then stat on every file using a lot of > threads or an asyncronous stat interface We have done something similar to what you suggest. We wrote a parallel file tree walker to run on clustered file systems that spread the file systems metadata out over multiple disks. The program parallelizes the stat operations across multiple nodes (via MPI). We needed to walk a tree with about a hundred million files in a reasonable amount of time. We cut the time from dozens of hours to less than an hour. We were able to keep all the metadata raids/disks much busier doing the work for the stat operations. We have used this on two different clustered file systems with similar results. In both cases, it scaled with the number of disks the metadata was spread over, not quite linearly but it was a huge win for these two file systems. Gary >As far as parallel IO goes, I would think that async stat would be >nearly as fast as readdirplus? >For the clustered filesystem case there may be locking issues that makes >readdirplus faster? > > >-- >Ragnar Kjørstad >Software Engineer >Scali - http://www.scali.com >Scaling the Linux Datacenter - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-06 17:42 ` Ulrich Drepper 2006-12-06 18:01 ` Ragnar Kjørstad @ 2006-12-07 5:57 ` Andreas Dilger 2006-12-15 22:37 ` Ulrich Drepper 1 sibling, 1 reply; 124+ messages in thread From: Andreas Dilger @ 2006-12-07 5:57 UTC (permalink / raw) To: Ulrich Drepper Cc: Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 06, 2006 09:42 -0800, Ulrich Drepper wrote: > Rob Ross wrote: > >File size is definitely one of the more difficult of the parameters, > >either because (a) it isn't stored in one place but is instead derived, > >or (b) because a lock has to be obtained to guarantee consistency of the > >returned value. > > OK, and looking at the man page again, it is already on the list in the > old proposal and hence optional. I've no problem with that. IMHO, once part of the information is optional, why bother making ANY of it required? Consider "ls -s" on a distributed filesystem that has UID+GID mapping. It doesn't actually NEED to return the UID+GID to ls for each file, since it won't be shown, but if that is part of the "required" fields then the filesystem would have to remap each UID+GID on each file in the directory. Similar arguments can be made for "find" with various options (-atime, -mtime, etc) where any one of the "required" parameters isn't needed. I don't think it is _harmful_ to fill in unrequested values if they are readily available (it might in fact avoid a lot of conditional branches) but why not let the caller request only the minimum information it needs? > >I can't speak for everyone, but "ls" is the #1 consumer as far as I am > >concerned. That is my opinion also. Lustre can do incredibly fast IO, but it isn't very good at "ls" at all because it has to do way more work than you would think (and I've stared at a lot of straces from ls, rm, etc). > I think this is more a user problem. For normal plain old 'ls' you get > by with readdir. For 'ls -F' and 'ls --color' you mostly get by with > readdir+d_type. If you cannot provide d_type info the readdirplus > extension does you no good. For the cases when an additional stat is > needed (for symlinks, for instance, to test whether they are dangling) > readdirplus won't help. I used to think this also, but even though Lustre supplies d_type info GNU ls will still do stat operations because "ls --color" depends on st_mode in order to color executable files differently. Since virtually all distros alias ls to "ls --color" this is pretty much default behaviour. Another popular alias is "ls -F" which also uses st_mode for executables. $ strace ls --color=yes # this is on an ext3 filesystem : : open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 fstat64(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0 getdents64(3, /* 53 entries */, 4096) = 1840 lstat64("ChangeLog", {st_mode=S_IFREG|0660, st_size=48, ...}) = 0 lstat64("install-sh", {st_mode=S_IFREG|0755, st_size=7122, ...}) = 0 lstat64("config.sub", {st_mode=S_IFREG|0755, st_size=30221, ...}) = 0 lstat64("autogen.sh", {st_mode=S_IFREG|0660, st_size=41, ...}) = 0 lstat64("config.h", {st_mode=S_IFREG|0664, st_size=7177, ...}) = 0 lstat64("COPYING", {st_mode=S_IFREG|0660, st_size=18483, ...}) = 0 : : Similarly, GNU rm will stat all of the files (when run as a regular user) to ask the "rm: remove write-protected regular file `foo.orig'?" question, which also depends on st_mode. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-07 5:57 ` Andreas Dilger @ 2006-12-15 22:37 ` Ulrich Drepper 2006-12-16 18:13 ` Andreas Dilger 0 siblings, 1 reply; 124+ messages in thread From: Ulrich Drepper @ 2006-12-15 22:37 UTC (permalink / raw) To: Ulrich Drepper, Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Andreas Dilger wrote: > IMHO, once part of the information is optional, why bother making ANY > of it required? Consider "ls -s" on a distributed filesystem that has > UID+GID mapping. It doesn't actually NEED to return the UID+GID to ls > for each file, since it won't be shown, but if that is part of the > "required" fields then the filesystem would have to remap each UID+GID > on each file in the directory. Similar arguments can be made for "find" > with various options (-atime, -mtime, etc) where any one of the "required" > parameters isn't needed. The kernel at least has to clear the fields in the stat structure in any case. So, if information is easily available, why add another 'if' in the case if the real information can be filled in just as easily? I don't know the kernel code but I would sincerely look at this case. The extra 'if's might be more expensive than just doing it. > I used to think this also, but even though Lustre supplies d_type info > GNU ls will still do stat operations because "ls --color" depends on > st_mode in order to color executable files differently. Right, and only executables. You can easily leave out the :ex=*** part of LS_COLORS. I don't think it's useful to introduce a new system call just to have this support. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-15 22:37 ` Ulrich Drepper @ 2006-12-16 18:13 ` Andreas Dilger 2006-12-16 19:08 ` Ulrich Drepper 0 siblings, 1 reply; 124+ messages in thread From: Andreas Dilger @ 2006-12-16 18:13 UTC (permalink / raw) To: Ulrich Drepper Cc: Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 15, 2006 14:37 -0800, Ulrich Drepper wrote: > Andreas Dilger wrote: > >IMHO, once part of the information is optional, why bother making ANY > >of it required? Consider "ls -s" on a distributed filesystem that has > >UID+GID mapping. It doesn't actually NEED to return the UID+GID to ls > >for each file, since it won't be shown, but if that is part of the > >"required" fields then the filesystem would have to remap each UID+GID > >on each file in the directory. Similar arguments can be made for "find" > >with various options (-atime, -mtime, etc) where any one of the "required" > >parameters isn't needed. > > The kernel at least has to clear the fields in the stat structure in any > case. So, if information is easily available, why add another 'if' in > the case if the real information can be filled in just as easily? The kernel doesn't necessarily have to clear the fields. The per-field valid flag would determine is that field had valid data or garbage. That said, there is no harm in the kernel/fs filling in additional fields is they are readily available. It would be up to the caller to NOT ask for fields that it doesn't need, as that _might_ cause additional work on the part of the filesystem. If the kernel returns extra valid fields (because they are "free") that the application doesn't use, no harm done. > >I used to think this also, but even though Lustre supplies d_type info > >GNU ls will still do stat operations because "ls --color" depends on > >st_mode in order to color executable files differently. > > Right, and only executables. > > You can easily leave out the :ex=*** part of LS_COLORS. Tell that to every distro maintainer, and/or try to convince the upstream "ls" maintainers to change this. :-) > I don't think it's useful to introduce a new system call just to have > this support. It isn't just to fix the ls --color problem. There are lots of other apps that need some stat fields and not others. Also, implementing the compatibility support for this (statlite->stat(), flags=$all_valid) is trivial, if potentially less performant (though no worse than today). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-16 18:13 ` Andreas Dilger @ 2006-12-16 19:08 ` Ulrich Drepper 0 siblings, 0 replies; 124+ messages in thread From: Ulrich Drepper @ 2006-12-16 19:08 UTC (permalink / raw) To: Ulrich Drepper, Rob Ross, Christoph Hellwig, Trond Myklebust, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Andreas Dilger wrote: > The kernel doesn't necessarily have to clear the fields. The per-field > valid flag would determine is that field had valid data or garbage. You cannot leak kernel memory content. Either you clear the field or, in the code which actually copies the data to userlevel, you copy again field by field. The latter is far too slow. So you better clear all fields. >> You can easily leave out the :ex=*** part of LS_COLORS. > > Tell that to every distro maintainer, and/or try to convince the upstream > "ls" maintainers to change this. :-) Why? Tell this to people who are affected. > It isn't just to fix the ls --color problem. There are lots of other > apps that need some stat fields and not others. Name them. I've asked for it before and got the answer "it's mainly ls". Now ls is debunked. So, provide more evidence that the getdirentplus support is needed. > Also, implementing > the compatibility support for this (statlite->stat(), flags=$all_valid) > is trivial, if potentially less performant (though no worse than today). We're not talking about statlite. The ls case is about getdirentplus. I fail to see evidence that it is really needed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* statlite() 2006-12-05 23:55 ` Ulrich Drepper 2006-12-06 10:06 ` Andreas Dilger @ 2006-12-14 23:58 ` Rob Ross 1 sibling, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-14 23:58 UTC (permalink / raw) To: Ulrich Drepper Cc: Christoph Hellwig, Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel We're going to clean the statlite() call up based on this (and subsequent) discussion and post again. Thanks! Rob Ulrich Drepper wrote: > Christoph Hellwig wrote: >> Ulrich, this in reply to these API proposals: > > I know the documents. The HECWG was actually supposed to submit an > actual draft to the OpenGroup-internal working group but I haven't seen > anything yet. I'm not opposed to getting real-world experience first. > > >>> So other than this "lite" version of the readdirplus() call, and this >>> idea of making the flags indicate validity rather than accuracy, are >>> there other comments on the directory-related calls? I understand >>> that they might or might not ever make it in, but assuming they did, >>> what other changes would you like to see? > > I don't think an accuracy flag is useful at all. Programs don't want to > use fuzzy information. If you want a fast 'ls -l' then add a mode which > doesn't print the fields which are not provided. Don't provide outdated > information. Similarly for other programs. > > >> statlite needs to separate the flag for valid fields from the actual >> stat structure and reuse the existing stat(64) structure. stat lite >> needs to at least get a better name, even better be folded into *statat*, >> either by having a new AT_VALID_MASK flag that enables a new >> unsigned int valid argument or by folding the valid flags into the AT_ >> flags. > > Yes, this is also my pet peeve with this interface. I don't want to > have another data structure. Especially since programs might want to > store the value in places where normal stat results are returned. > > And also yes on 'statat'. I strongly suggest to define only a statat > variant. In the standards group I'll vehemently oppose the introduction > of yet another superfluous non-*at interface. > > As for reusing the existing statat interface and magically add another > parameter through ellipsis: no. We need to become more type-safe. The > userlevel interface needs to be a new one. For the system call there is > no such restriction. We can indeed extend the existing syscall. We > have appropriate checks for the validity of the flags parameter in place > which make such calls backward compatible. > > > >> I think having a stat lite variant is pretty much consensus, we just need >> to fine tune the actual API - and of course get a reference >> implementation. >> So if you want to get this going try to implement it based on >> http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2. >> Bonus points for actually making use of the flags in some filesystems. > > I don't like that approach. The flag parameter should be exclusively an > output parameter. By default the kernel should fill in all the fields > it has access to. If access is not easily possible then set the bit and > clear the field. There are of course certain fields which always should > be added. In the proposed man page these are already identified (i.e., > those before the st_litemask member). > > >> At the actual >> C prototype level I would rename d_stat_err to d_stat_errno for >> consistency >> and maybe drop the readdirplus() entry point in favour of readdirplus_r >> only - there is no point in introducing new non-reenetrant APIs today. > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 22:05 ` Christoph Hellwig 2006-12-05 23:18 ` Sage Weil 2006-12-05 23:55 ` Ulrich Drepper @ 2006-12-07 23:39 ` Nikita Danilov 2 siblings, 0 replies; 124+ messages in thread From: Nikita Danilov @ 2006-12-07 23:39 UTC (permalink / raw) To: Christoph Hellwig Cc: Trond Myklebust, Andreas Dilger, Sage Weil, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel, drepper Christoph Hellwig writes: > I'd like to Cc Ulrich Drepper in this thread because he's going to decide > what APIs will be exposed at the C library level in the end, and he also > has quite a lot of experience with the various standardization bodies. > > Ulrich, this in reply to these API proposals: > > http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf > http://www.opengroup.org/platform/hecewg/uploads/40/10898/POSIX-stat-manpages.pdf What readdirplus() is supposed to return in ->d_stat field for a name "foo" in directory "bar" when "bar/foo" is a mount-point? Note that in the case of distributed file system, server has no idea about client mount-points, which implies some form of local post-processing. Nikita. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 0:59 ` Rob Ross 2006-12-05 4:44 ` Gary Grider 2006-12-05 5:56 ` Trond Myklebust @ 2006-12-05 14:37 ` Peter Staubach 2 siblings, 0 replies; 124+ messages in thread From: Peter Staubach @ 2006-12-05 14:37 UTC (permalink / raw) To: Rob Ross Cc: Trond Myklebust, Andreas Dilger, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Rob Ross wrote: > > Regarding Peter Staubach's comments about no one ever using the > readdirplus() call; well, if people weren't performing this workload > in the first place, we wouldn't *need* this sort of call! This call is > specifically targeted at improving "ls -l" performance on large > directories, and Sage has pointed out quite nicely how that might work. I don't think that anyone has shown a *need* for this sort of call yet, actually. What application would actually benefit from this call and where are the measurements? Simply asserting that "ls -l" will benefit is not enough without some measurements. Or mention a different real world application... Having developed and prototyped the NFSv3 READDIRPLUS, I can tell you that the wins were less than expected/hoped for and while it wasn't all that hard to implement in a simple way, doing so in a high performance fashion is much harder. Many implementations that I have heard about turn off READDIRPLUS when dealing with a large directory. Caching is what makes things fast and caching means avoiding going over the network. ps ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: readdirplus() as possible POSIX I/O API 2006-12-04 15:15 ` Trond Myklebust 2006-12-05 0:59 ` Rob Ross @ 2006-12-05 10:26 ` Andreas Dilger 2006-12-05 15:23 ` Trond Myklebust 2006-12-05 17:06 ` Latchesar Ionkov 1 sibling, 2 replies; 124+ messages in thread From: Andreas Dilger @ 2006-12-05 10:26 UTC (permalink / raw) To: Trond Myklebust Cc: Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 04, 2006 10:15 -0500, Trond Myklebust wrote: > I propose that we implement this sort of thing in the kernel via a readdir > equivalent to posix_fadvise(). That can give exactly the barrier > semantics that they are asking for, and only costs 1 extra syscall as > opposed to 2 (opendirplus() and readdirplus()). I think the "barrier semantics" are something that have just crept into this discussion and is confusing the issue. The primary goal (IMHO) of this syscall is to allow the filesystem (primarily distributed cluster filesystems, but HFS and NTFS developers seem on board with this too) to avoid tens to thousands of stat RPCs in very common ls -R, find, etc. kind of operations. I can't see how fadvise() could help this case? Yes, it would tell the filesystem that it could do readahead of the readdir() data, but the app will still be doing stat() on each of the thousands of files in the directory, instantiating inodes and dentries on that node (which need locking, and potentially immediate lock revocation if the files are being written to by other nodes). In some cases (e.g. rm -r, grep -r) that might even be a win, because the client will soon be touching all of those files, but not necessarily in the ls -lR, find cases. The filesystem can't always do "stat-ahead" on the files because that requires instantiating an inode on the client which may be stale (lock revoked) by the time the app gets to it, and the app (and the VFS) have no idea just how stale it is, and whether the stat is a "real" stat or "only" the readdir stat (because the fadvise would only be useful on the directory, and not all of the child entries), so it would need to re-stat the file. Also, this would potentially blow the client's real working set of inodes out of cache. Doing things en-masse with readdirplus() also allows the filesystem to do the stat() operations in parallel internally (which is a net win if there are many servers involved) instead of serially as the application would do. Cheers, Andreas PS - I changed the topic to separate this from the openfh() thread. -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: readdirplus() as possible POSIX I/O API 2006-12-05 10:26 ` readdirplus() as possible POSIX I/O API Andreas Dilger @ 2006-12-05 15:23 ` Trond Myklebust 2006-12-06 10:28 ` Andreas Dilger 2006-12-05 17:06 ` Latchesar Ionkov 1 sibling, 1 reply; 124+ messages in thread From: Trond Myklebust @ 2006-12-05 15:23 UTC (permalink / raw) To: Andreas Dilger Cc: Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote: > On Dec 04, 2006 10:15 -0500, Trond Myklebust wrote: > > I propose that we implement this sort of thing in the kernel via a readdir > > equivalent to posix_fadvise(). That can give exactly the barrier > > semantics that they are asking for, and only costs 1 extra syscall as > > opposed to 2 (opendirplus() and readdirplus()). > > I think the "barrier semantics" are something that have just crept > into this discussion and is confusing the issue. It is the _only_ concept that is of interest for something like NFS or CIFS. We already have the ability to cache the information. > The primary goal (IMHO) of this syscall is to allow the filesystem > (primarily distributed cluster filesystems, but HFS and NTFS developers > seem on board with this too) to avoid tens to thousands of stat RPCs in > very common ls -R, find, etc. kind of operations. > > I can't see how fadvise() could help this case? Yes, it would tell the > filesystem that it could do readahead of the readdir() data, but the > app will still be doing stat() on each of the thousands of files in the > directory, instantiating inodes and dentries on that node (which need > locking, and potentially immediate lock revocation if the files are > being written to by other nodes). In some cases (e.g. rm -r, grep -r) > that might even be a win, because the client will soon be touching all > of those files, but not necessarily in the ls -lR, find cases. 'find' should be quite happy with the existing readdir(). It does not need to use stat() or readdirplus() in order to recurse because readdir() provides d_type. The locking problem is only of interest to clustered filesystems. On local filesystems such as HFS, NTFS, and on networked filesystems like NFS or CIFS, the only lock that matters is the parent directory's inode->i_sem, which is held by readdir() anyway. If the application is able to select a statlite()-type of behaviour with the fadvise() hints, your filesystem could be told to serve up cached information instead of regrabbing locks. In fact that is a much more flexible scheme, since it also allows the filesystem to background the actual inode lookups, or to defer them altogether if that is more efficient. > The filesystem can't always do "stat-ahead" on the files because that > requires instantiating an inode on the client which may be stale (lock > revoked) by the time the app gets to it, and the app (and the VFS) have > no idea just how stale it is, and whether the stat is a "real" stat or > "only" the readdir stat (because the fadvise would only be useful on > the directory, and not all of the child entries), so it would need to > re-stat the file. Then provide hints that allow the app to select which behaviour it prefers. Most (all?) apps don't _care_, and so would be quite happy with cached information. That is why the current NFS caching model exists in the first place. > Also, this would potentially blow the client's real > working set of inodes out of cache. Why? > Doing things en-masse with readdirplus() also allows the filesystem to > do the stat() operations in parallel internally (which is a net win if > there are many servers involved) instead of serially as the application > would do. If your application really cared, it could add threading to 'ls' to achieve the same result. You can also have the filesystem preload that information based on fadvise hints. Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: readdirplus() as possible POSIX I/O API 2006-12-05 15:23 ` Trond Myklebust @ 2006-12-06 10:28 ` Andreas Dilger 2006-12-06 15:10 ` Trond Myklebust 0 siblings, 1 reply; 124+ messages in thread From: Andreas Dilger @ 2006-12-06 10:28 UTC (permalink / raw) To: Trond Myklebust Cc: Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Dec 05, 2006 10:23 -0500, Trond Myklebust wrote: > On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote: > > I think the "barrier semantics" are something that have just crept > > into this discussion and is confusing the issue. > > It is the _only_ concept that is of interest for something like NFS or > CIFS. We already have the ability to cache the information. Actually, wouldn't the ability for readdirplus() (with valid flag) be useful for NFS if only to indicate that it does not need to flush the cache because the ctime/mtime isn't needed by the caller? > 'find' should be quite happy with the existing readdir(). It does not > need to use stat() or readdirplus() in order to recurse because > readdir() provides d_type. It does in any but the most simplistic invocations, like "find -mtime" or "find -mode" or "find -uid", etc. > If the application is able to select a statlite()-type of behaviour with > the fadvise() hints, your filesystem could be told to serve up cached > information instead of regrabbing locks. In fact that is a much more > flexible scheme, since it also allows the filesystem to background the > actual inode lookups, or to defer them altogether if that is more > efficient. I guess I just don't understand how fadvise() on a directory file handle (used for readdir()) can be used to affect later stat operations (which definitely will NOT be using that file handle)? If you mean that the application should actually open() each file, fadvise(), fstat(), close(), instead of just a stat() call then we are WAY into negative improvements here due to overhead of doing open+close. > > The filesystem can't always do "stat-ahead" on the files because that > > requires instantiating an inode on the client which may be stale (lock > > revoked) by the time the app gets to it, and the app (and the VFS) have > > no idea just how stale it is, and whether the stat is a "real" stat or > > "only" the readdir stat (because the fadvise would only be useful on > > the directory, and not all of the child entries), so it would need to > > re-stat the file. > > Then provide hints that allow the app to select which behaviour it > prefers. Most (all?) apps don't _care_, and so would be quite happy with > cached information. That is why the current NFS caching model exists in > the first place. Most clustered filesystems have strong cache semantics, so that isn't a problem. IMHO, the mechanism to pass the hint to the filesystem IS the readdirplus_lite() that tells the filesystem exactly which data is needed on each directory entry. > > Also, this would potentially blow the client's real > > working set of inodes out of cache. > > Why? Because in many cases it is desirable to limit the number of DLM locks on a given client (e.g. GFS2 thread with AKPM about clients with millions of DLM locks due to lack of memory pressure on large mem systems). That means a finite-size lock LRU on the client that risks being wiped out by a few thousand files in a directory doing "readdir() + 5000*stat()". Consider a system like BlueGene/L with 128k compute cores. Jobs that run on that system will periodically (e.g. every hour) create up to 128K checkpoint+restart files to avoid losing a lot of computation if a node crashes. Even if each one of the checkpoints is in a separate directory (I wish all users were so nice :-) it means 128K inodes+DLM locks for doing an "ls" in the directory. > > Doing things en-masse with readdirplus() also allows the filesystem to > > do the stat() operations in parallel internally (which is a net win if > > there are many servers involved) instead of serially as the application > > would do. > > If your application really cared, it could add threading to 'ls' to > achieve the same result. You can also have the filesystem preload that > information based on fadvise hints. But it would still need 128K RPCs to get that information, and 128K new inodes on that client. And what is the chance that I can get a multi-threading "ls" into the upstream GNU ls code? In the case of local filesystems multi-threading ls would be a net loss due to seeking. But even for local filesystems readdirplus_lite() would allow them to fill in stat information they already have (either in cache or on disk), and may avoid doing extra work that isn't needed. For filesystems that don't care, readdirplus_lite() can just be readdir()+stat() internally. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: readdirplus() as possible POSIX I/O API 2006-12-06 10:28 ` Andreas Dilger @ 2006-12-06 15:10 ` Trond Myklebust 0 siblings, 0 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-06 15:10 UTC (permalink / raw) To: Andreas Dilger Cc: Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On Wed, 2006-12-06 at 03:28 -0700, Andreas Dilger wrote: > On Dec 05, 2006 10:23 -0500, Trond Myklebust wrote: > > On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote: > > > I think the "barrier semantics" are something that have just crept > > > into this discussion and is confusing the issue. > > > > It is the _only_ concept that is of interest for something like NFS or > > CIFS. We already have the ability to cache the information. > > Actually, wouldn't the ability for readdirplus() (with valid flag) be > useful for NFS if only to indicate that it does not need to flush the > cache because the ctime/mtime isn't needed by the caller? That is why statlite() might be useful. I'd prefer something more generic, though. > > 'find' should be quite happy with the existing readdir(). It does not > > need to use stat() or readdirplus() in order to recurse because > > readdir() provides d_type. > > It does in any but the most simplistic invocations, like "find -mtime" > or "find -mode" or "find -uid", etc. The only 'win' a readdirplus may give you there as far as NFS is concerned is the sysenter overhead that you would have for calling stat() (or statlite() many times). > > If the application is able to select a statlite()-type of behaviour with > > the fadvise() hints, your filesystem could be told to serve up cached > > information instead of regrabbing locks. In fact that is a much more > > flexible scheme, since it also allows the filesystem to background the > > actual inode lookups, or to defer them altogether if that is more > > efficient. > > I guess I just don't understand how fadvise() on a directory file handle > (used for readdir()) can be used to affect later stat operations (which > definitely will NOT be using that file handle)? If you mean that the > application should actually open() each file, fadvise(), fstat(), close(), > instead of just a stat() call then we are WAY into negative improvements > here due to overhead of doing open+close. On the contrary, the readdir descriptor is used in all those funky new statat(), calls. Ditto for readlinkat(), faccessat(). You could even have openat() turn off the close-to-open GETATTR if the readdir descriptor contained a hint that told it that was unnecessary. Furthermore, since the fadvise-like caching operation works on filehandles, you could have it work both on readdir() for the benefit of the above *at() calls, and also on the regular file descriptor for the benefit of fstat(), fgetxattr(). > > > The filesystem can't always do "stat-ahead" on the files because that > > > requires instantiating an inode on the client which may be stale (lock > > > revoked) by the time the app gets to it, and the app (and the VFS) have > > > no idea just how stale it is, and whether the stat is a "real" stat or > > > "only" the readdir stat (because the fadvise would only be useful on > > > the directory, and not all of the child entries), so it would need to > > > re-stat the file. > > > > Then provide hints that allow the app to select which behaviour it > > prefers. Most (all?) apps don't _care_, and so would be quite happy with > > cached information. That is why the current NFS caching model exists in > > the first place. > > Most clustered filesystems have strong cache semantics, so that isn't > a problem. IMHO, the mechanism to pass the hint to the filesystem IS > the readdirplus_lite() that tells the filesystem exactly which data is > needed on each directory entry. > > > > Also, this would potentially blow the client's real > > > working set of inodes out of cache. > > > > Why? > > Because in many cases it is desirable to limit the number of DLM locks > on a given client (e.g. GFS2 thread with AKPM about clients with > millions of DLM locks due to lack of memory pressure on large mem systems). > That means a finite-size lock LRU on the client that risks being wiped > out by a few thousand files in a directory doing "readdir() + 5000*stat()". > > > Consider a system like BlueGene/L with 128k compute cores. Jobs that > run on that system will periodically (e.g. every hour) create up to 128K > checkpoint+restart files to avoid losing a lot of computation if a node > crashes. Even if each one of the checkpoints is in a separate directory > (I wish all users were so nice :-) it means 128K inodes+DLM locks for doing > an "ls" in the directory. That is precisely the sort of situation where knowing when you can cache, and when you cannot would be a plus. An ls call may not need 128k dlm locks, because it only cares about the state of the inodes as they were at the start of the opendir() call. > > > Doing things en-masse with readdirplus() also allows the filesystem to > > > do the stat() operations in parallel internally (which is a net win if > > > there are many servers involved) instead of serially as the application > > > would do. > > > > If your application really cared, it could add threading to 'ls' to > > achieve the same result. You can also have the filesystem preload that > > information based on fadvise hints. > > But it would still need 128K RPCs to get that information, and 128K new > inodes on that client. And what is the chance that I can get a > multi-threading "ls" into the upstream GNU ls code? In the case of local > filesystems multi-threading ls would be a net loss due to seeking. NFS doesn't 'cos it implements readdirplus under the covers as far as userland is concerned. > But even for local filesystems readdirplus_lite() would allow them to > fill in stat information they already have (either in cache or on disk), > and may avoid doing extra work that isn't needed. For filesystems that > don't care, readdirplus_lite() can just be readdir()+stat() internally. The thing to note, though, is that in the NFS implementation we are _very_ careful about use the GETATTR information it returns if there is already an inode instantiated for that dentry. This is precisely because we don't want to deal with the issue of synchronisation w.r.t. an inode that may be under writeout, that may be the subject of setattr() calls, etc. As far as we're concerned, READDIRPLUS is a form of mass LOOKUP, not a mass inode revalidation Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: readdirplus() as possible POSIX I/O API 2006-12-05 10:26 ` readdirplus() as possible POSIX I/O API Andreas Dilger 2006-12-05 15:23 ` Trond Myklebust @ 2006-12-05 17:06 ` Latchesar Ionkov 2006-12-05 22:48 ` Rob Ross 1 sibling, 1 reply; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-05 17:06 UTC (permalink / raw) To: Trond Myklebust, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel On 12/5/06, Andreas Dilger <adilger@clusterfs.com> wrote: > The primary goal (IMHO) of this syscall is to allow the filesystem > (primarily distributed cluster filesystems, but HFS and NTFS developers > seem on board with this too) to avoid tens to thousands of stat RPCs in > very common ls -R, find, etc. kind of operations. I don't think that ls -R and find are that common cases that they need introduction of new operations in order to made them faster. On the other hand may be they are often being used to do microbenchmarks. If you goal is to make these filesystems look faster on microbenchmarks, then probably you have the right solution. For normal use, especially on clusters, I don't see any advantage of doing that. Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: readdirplus() as possible POSIX I/O API 2006-12-05 17:06 ` Latchesar Ionkov @ 2006-12-05 22:48 ` Rob Ross 0 siblings, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-05 22:48 UTC (permalink / raw) To: Latchesar Ionkov Cc: Trond Myklebust, Sage Weil, Christoph Hellwig, Brad Boyer, Anton Altaparmakov, Gary Grider, linux-fsdevel Hi Lucho, Andreas is right on mark. The problem here is that when one user kicks off an ls -l or ls -R on a cluster file system *while other users are trying to get work done*, all those stat RPCs and lock reclamations can kill performance. We're not interested in a "ls -lR" top 500, we're interested in making systems more usable, more tolerant to everyday user behaviors. Regards, Rob Latchesar Ionkov wrote: > On 12/5/06, Andreas Dilger <adilger@clusterfs.com> wrote: >> The primary goal (IMHO) of this syscall is to allow the filesystem >> (primarily distributed cluster filesystems, but HFS and NTFS developers >> seem on board with this too) to avoid tens to thousands of stat RPCs in >> very common ls -R, find, etc. kind of operations. > > I don't think that ls -R and find are that common cases that they need > introduction of new operations in order to made them faster. On the > other hand may be they are often being used to do microbenchmarks. If > you goal is to make these filesystems look faster on microbenchmarks, > then probably you have the right solution. For normal use, especially > on clusters, I don't see any advantage of doing that. > > Thanks, > Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 9:48 ` Andreas Dilger 2006-11-29 10:18 ` Anton Altaparmakov @ 2006-11-29 10:25 ` Steven Whitehouse 2006-11-30 12:29 ` Christoph Hellwig 2006-12-01 15:52 ` Ric Wheeler 2 siblings, 1 reply; 124+ messages in thread From: Steven Whitehouse @ 2006-11-29 10:25 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel, Gary Grider, Christoph Hellwig Hi, On Wed, 2006-11-29 at 01:48 -0800, Andreas Dilger wrote: > On Nov 29, 2006 09:04 +0000, Christoph Hellwig wrote: > > - readdirplus > > > > This one is completely unneeded as a kernel API. Doing readdir > > plus calls on the wire makes a lot of sense and we already do > > that for NFSv3+. Doing this at the syscall layer just means > > kernel bloat - syscalls are very cheap. > > The question is how does the filesystem know that the application is > going to do readdir + stat every file? It has to do this as a heuristic > implemented in the filesystem to determine if the ->getattr() calls match > the ->readdir() order. If the application knows that it is going to be > doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem > take advantage of this information? If combined with the statlite > interface, it can make a huge difference for clustered filesystems. > > Cheers, Andreas I agree that this is a good plan, but I'd been looking at this idea from a different direction recently. The in kernel NFS server calls vfs_getattr from its filldir routine for readdirplus and this means not only are we unable to optimise performance by (for example) sorting groups of getattr calls so that we read the inodes in disk block order, but also that its effectively enforcing a locking order of the inodes on us too. Since we can have async locking in GFS2, we should be able to do "lockahead" with readdirplus too. I had been considering proposing a readdirplus export operation, but since this thread has come up, perhaps a file operation would be preferable as it could solve two problems with one operation? Steve. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 10:25 ` NFSv4/pNFS possible POSIX I/O API standards Steven Whitehouse @ 2006-11-30 12:29 ` Christoph Hellwig 0 siblings, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-11-30 12:29 UTC (permalink / raw) To: Steven Whitehouse Cc: Andreas Dilger, linux-fsdevel, Gary Grider, Christoph Hellwig On Wed, Nov 29, 2006 at 10:25:07AM +0000, Steven Whitehouse wrote: > I agree that this is a good plan, but I'd been looking at this idea from > a different direction recently. The in kernel NFS server calls > vfs_getattr from its filldir routine for readdirplus and this means not > only are we unable to optimise performance by (for example) sorting > groups of getattr calls so that we read the inodes in disk block order, > but also that its effectively enforcing a locking order of the inodes on > us too. Since we can have async locking in GFS2, we should be able to do > "lockahead" with readdirplus too. > > I had been considering proposing a readdirplus export operation, but > since this thread has come up, perhaps a file operation would be > preferable as it could solve two problems with one operation? Doing this as an export operation is wrong. Even if it's only used for nfsd for now the logical level this should be on are the file operations. If you do it you could probably prototype a syscall for it aswell - once we have the infrastructure the syscall should be no more than about 20 lines of code. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 9:48 ` Andreas Dilger 2006-11-29 10:18 ` Anton Altaparmakov 2006-11-29 10:25 ` NFSv4/pNFS possible POSIX I/O API standards Steven Whitehouse @ 2006-12-01 15:52 ` Ric Wheeler 2 siblings, 0 replies; 124+ messages in thread From: Ric Wheeler @ 2006-12-01 15:52 UTC (permalink / raw) To: Christoph Hellwig, Gary Grider, linux-fsdevel Andreas Dilger wrote: > On Nov 29, 2006 09:04 +0000, Christoph Hellwig wrote: >> - readdirplus >> >> This one is completely unneeded as a kernel API. Doing readdir >> plus calls on the wire makes a lot of sense and we already do >> that for NFSv3+. Doing this at the syscall layer just means >> kernel bloat - syscalls are very cheap. > > The question is how does the filesystem know that the application is > going to do readdir + stat every file? It has to do this as a heuristic > implemented in the filesystem to determine if the ->getattr() calls match > the ->readdir() order. If the application knows that it is going to be > doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem > take advantage of this information? If combined with the statlite > interface, it can make a huge difference for clustered filesystems. > > I think that this kind of heuristic would be a win for local file systems with a huge number of files as well... ric ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 9:04 ` Christoph Hellwig 2006-11-29 9:14 ` Christoph Hellwig 2006-11-29 9:48 ` Andreas Dilger @ 2006-11-29 12:23 ` Matthew Wilcox 2006-11-29 12:35 ` Matthew Wilcox 2006-11-29 12:39 ` Christoph Hellwig 2 siblings, 2 replies; 124+ messages in thread From: Matthew Wilcox @ 2006-11-29 12:23 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Gary Grider, linux-fsdevel On Wed, Nov 29, 2006 at 09:04:50AM +0000, Christoph Hellwig wrote: > - openg/sutoc > > No way. We already have a very nice file descriptor abstraction. > You can pass file descriptors over unix sockets just fine. Yes, but it behaves like dup(). Gary replied to me off-list (which I didn't notice and continued replying to him off-list). I wrote: Is this for people who don't know about dup(), or do they need independent file offsets? If the latter, I think an xdup() would be preferable (would there be a security issue for OSes with revoke()?) Either that, or make the key be useful for something else. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 12:23 ` Matthew Wilcox @ 2006-11-29 12:35 ` Matthew Wilcox 2006-11-29 16:26 ` Gary Grider 2006-11-29 12:39 ` Christoph Hellwig 1 sibling, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-11-29 12:35 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Gary Grider, linux-fsdevel On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote: > On Wed, Nov 29, 2006 at 09:04:50AM +0000, Christoph Hellwig wrote: > > - openg/sutoc > > > > No way. We already have a very nice file descriptor abstraction. > > You can pass file descriptors over unix sockets just fine. > > Yes, but it behaves like dup(). Gary replied to me off-list (which I > didn't notice and continued replying to him off-list). I wrote: > > Is this for people who don't know about dup(), or do they need > independent file offsets? If the latter, I think an xdup() would be > preferable (would there be a security issue for OSes with revoke()?) > Either that, or make the key be useful for something else. I further wonder if these people would see appreciable gains from doing sutoc rather than doing openat(dirfd, "basename", flags, mode); If they did, I could also see openat being extended to allow dirfd to be a file fd, as long as pathname were NULL or a pointer to NUL. But with all the readx stuff being proposed, I bet they don't really need independent file offsets. That's, like, so *1970*s. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 12:35 ` Matthew Wilcox @ 2006-11-29 16:26 ` Gary Grider 2006-11-29 17:18 ` Christoph Hellwig 0 siblings, 1 reply; 124+ messages in thread From: Gary Grider @ 2006-11-29 16:26 UTC (permalink / raw) To: Matthew Wilcox, Christoph Hellwig; +Cc: linux-fsdevel At 05:35 AM 11/29/2006, Matthew Wilcox wrote: >On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote: > > On Wed, Nov 29, 2006 at 09:04:50AM +0000, Christoph Hellwig wrote: > > > - openg/sutoc > > > > > > No way. We already have a very nice file descriptor abstraction. > > > You can pass file descriptors over unix sockets just fine. > > > > Yes, but it behaves like dup(). Gary replied to me off-list (which I > > didn't notice and continued replying to him off-list). I wrote: > > > > Is this for people who don't know about dup(), or do they need > > independent file offsets? If the latter, I think an xdup() would be > > preferable (would there be a security issue for OSes with revoke()?) > > Either that, or make the key be useful for something else. There is a business case at the Open Group Web site. It is not a full use case document though. For a very tiny amount of background. It seems from the discussion that others (at least those working in clustered file systems) have seen the need for a statlite and readdir+ type function, what ever they might be called or how ever they might be implemented. As for openg, the gains have been seen in clustered file systems where you have 10s of thousands of processes spread out over thousands of machines. All 100k processes may open the same file and offset different amounts, sometimes strided sometimes not strided through the file. The opens all fire within a few milliseconds or less. This is a problem for large clustered file systems, open times have been seen in the minutes or worse. The writes all come at once as well quite often. Often they are complicated scatter gather operations spread out across the entire distributed memory of thousands of machines, not even in a completely uniform manner. A little knowledge about the intent of the application goes a long way when you are dealing with 100k parallelism. Additionally, having some notion of groups of processes collaborating at the file system level is useful for trying to make informed decisions about determinism and quality of service you might want to provide, how strictly you want to enforce rules on collaborating processes, etc. As for NFS acl's. This was going to be a separate extension volume, not associated with the performance portion. It comes up because many of the users of high end/clustered file system technology are also in often secure environments and have need to know issues. We were trying to be helpful to the NFSv4 community which has been kind enough to have these security features in their product. Additionally, this entire effort is being proposed as an extension, not as a change to the base POSIX I/O API. We certainly have no religion about how we make progress to assist the cluster file systems people and the NFSv4 people be better able to serve their communities, so all these comments are very welcomed. Thanks Gary >I further wonder if these people would see appreciable gains from doing >sutoc rather than doing openat(dirfd, "basename", flags, mode); > >If they did, I could also see openat being extended to allow dirfd to >be a file fd, as long as pathname were NULL or a pointer to NUL. > >But with all the readx stuff being proposed, I bet they don't really >need independent file offsets. That's, like, so *1970*s. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 16:26 ` Gary Grider @ 2006-11-29 17:18 ` Christoph Hellwig 0 siblings, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-11-29 17:18 UTC (permalink / raw) To: Gary Grider; +Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel Please don't repeat the stupid marketroid speach. If you want this to go anywhere please get someone with an actual clue to talk to us instead of you. Thanks a lot. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 12:23 ` Matthew Wilcox 2006-11-29 12:35 ` Matthew Wilcox @ 2006-11-29 12:39 ` Christoph Hellwig 2006-12-01 22:29 ` Rob Ross 1 sibling, 1 reply; 124+ messages in thread From: Christoph Hellwig @ 2006-11-29 12:39 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Gary Grider, linux-fsdevel On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote: > Is this for people who don't know about dup(), or do they need > independent file offsets? If the latter, I think an xdup() would be > preferable (would there be a security issue for OSes with revoke()?) > Either that, or make the key be useful for something else. Not sharing the file offset means we need a separate file struct, at which point the only thing saved is doing a lookup at the time of opening the file. While a full pathname traversal can be quite costly an open is not something you do all that often anyway. And if you really need to open/close files very often you can speed it up nicely by keeping a file descriptor on the parent directory open and use openat(). Anyway, enough of talking here. We really need a very good description of the use case people want this for, and the specific performance problems they see to find a solution. And the solution definitly does not involve as second half-assed file handle time with unspecified lifetime rules :-) ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-29 12:39 ` Christoph Hellwig @ 2006-12-01 22:29 ` Rob Ross 2006-12-02 2:35 ` Latchesar Ionkov 0 siblings, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-01 22:29 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Matthew Wilcox, Gary Grider, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 3708 bytes --] Hi all, The use model for openg() and openfh() (renamed sutoc()) is n processes spread across a large cluster simultaneously opening a file. The challenge is to avoid to the greatest extent possible incurring O(n) FS interactions. To do that we need to allow actions of one process to be reused by other processes on other OS instances. The openg() call allows one process to perform name resolution, which is often the most expensive part of this use model. Because permission checking is also performed as part of the openg(), some file systems to not require additional communication between OS and FS at openfh(). External communication channels are used to pass the handle resulting from the openg() call out to processes on other nodes (e.g. MPI_Bcast). dup(), openat(), and UNIX sockets are not viable options in this model, because there are many OS instances, not just one. All the calls that are being discussed as part of the HEC extensions are being discussed in this context of multiple OS instances and cluster file systems. Regarding the lifetime of the handle, there has been quite a bit of discussion about this. I believe that we most recently were thinking that there was an undefined lifetime for this, allowing servers to "forget" these values (as in the case where a server is restarted). Clients would need to perform the openg() again if they were to try to use an outdated handle, or simply fall back to a regular open(). This is not a problem in our use model. I've attached a graph showing the time to use individual open() calls vs. the openg()/MPI_Bcast()/openfh() combination; it's a clear win for any significant number of processes. These results are from our colleagues at Sandia (Ruth Klundt et. al.) with PVFS underneath, but I expect the trend to be similar for many cluster file systems. Regarding trying to "force APIs using standardization" on you (Christoph's 11/29/2006 message), you've got us all wrong. The standardization process is going to take some time, so we're starting on it at the same time that we're working with prototypes, so that we don't have to wait any longer than necessary to have these things be part of POSIX. The whole reason we're presenting this on this list is to try to describe why we think these calls are important and get feedback on how we can make these calls work well in the context of Linux. I'm glad to see so many people taking interest. I look forward to further constructive discussion. Thanks, Rob --- Rob Ross Mathematics and Computer Science Division Argonne National Laboratory Christoph Hellwig wrote: > On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote: >> Is this for people who don't know about dup(), or do they need >> independent file offsets? If the latter, I think an xdup() would be >> preferable (would there be a security issue for OSes with revoke()?) >> Either that, or make the key be useful for something else. > > Not sharing the file offset means we need a separate file struct, at > which point the only thing saved is doing a lookup at the time of > opening the file. While a full pathname traversal can be quite costly > an open is not something you do all that often anyway. And if you really > need to open/close files very often you can speed it up nicely by keeping > a file descriptor on the parent directory open and use openat(). > > Anyway, enough of talking here. We really need a very good description > of the use case people want this for, and the specific performance problems > they see to find a solution. And the solution definitly does not involve > as second half-assed file handle time with unspecified lifetime rules :-) [-- Attachment #2: openg-compare.pdf --] [-- Type: application/pdf, Size: 20518 bytes --] ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-01 22:29 ` Rob Ross @ 2006-12-02 2:35 ` Latchesar Ionkov 2006-12-05 0:37 ` Rob Ross 0 siblings, 1 reply; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-02 2:35 UTC (permalink / raw) To: Rob Ross; +Cc: Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel Hi, One general remark: I don't think it is feasible to add new system calls every time somebody has a problem. Usually there are (may be not that good) solutions that don't require big changes and work well enough. "Let's change the interface and make the life of many filesystem developers miserable, because they have to worry about 3-4-5 more operations" is not the easiest solution in the long run. On 12/1/06, Rob Ross <rross@mcs.anl.gov> wrote: > Hi all, > > The use model for openg() and openfh() (renamed sutoc()) is n processes > spread across a large cluster simultaneously opening a file. The > challenge is to avoid to the greatest extent possible incurring O(n) FS > interactions. To do that we need to allow actions of one process to be > reused by other processes on other OS instances. > > The openg() call allows one process to perform name resolution, which is > often the most expensive part of this use model. Because permission If the name resolution is the most expensive part, why not implement just the name lookup part and call it "lookup" instead of "openg". Or even better, make NFS to resolve multiple names with a single request. If the NFS server caches the last few name lookups, the responses from the other nodes will be fast, and you will get your file descriptor with two instead of the proposed one request. The performance could be just good enough without introducing any new functions and file handles. Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-02 2:35 ` Latchesar Ionkov @ 2006-12-05 0:37 ` Rob Ross 2006-12-05 10:02 ` Christoph Hellwig 2006-12-05 16:47 ` Latchesar Ionkov 0 siblings, 2 replies; 124+ messages in thread From: Rob Ross @ 2006-12-05 0:37 UTC (permalink / raw) To: Latchesar Ionkov Cc: Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel Hi, I agree that it is not feasible to add new system calls every time somebody has a problem, and we don't take adding system calls lightly. However, in this case we're talking about an entire *community* of people (high-end computing), not just one or two people. Of course it may still be the case that that community is not important enough to justify the addition of system calls; that's obviously not my call to make! I'm sure that you meant more than just to rename openg() to lookup(), but I don't understand what you are proposing. We still need a second call to take the results of the lookup (by whatever name) and convert that into a file descriptor. That's all the openfh() (previously named sutoc()) is for. I think the subject line might be a little misleading; we're not just talking about NFS here. There are a number of different file systems that might benefit from these enhancements (e.g. GPFS, Lustre, PVFS, PanFS, etc.). Finally, your comment on making filesystem developers miserable is sort of a point of philosophical debate for me. I personally find myself miserable trying to extract performance given the very small amount of information passing through the existing POSIX calls. The additional information passing through these new calls will make it much easier to obtain performance without correctly guessing what the user might actually be up to. While they do mean more work in the short term, they should also mean a more straight-forward path to performance for cluster/parallel file systems. Thanks for the input. Does this help explain why we don't think we can just work under the existing calls? Rob Latchesar Ionkov wrote: > Hi, > > One general remark: I don't think it is feasible to add new system > calls every time somebody has a problem. Usually there are (may be not > that good) solutions that don't require big changes and work well > enough. "Let's change the interface and make the life of many > filesystem developers miserable, because they have to worry about > 3-4-5 more operations" is not the easiest solution in the long run. > > On 12/1/06, Rob Ross <rross@mcs.anl.gov> wrote: >> Hi all, >> >> The use model for openg() and openfh() (renamed sutoc()) is n processes >> spread across a large cluster simultaneously opening a file. The >> challenge is to avoid to the greatest extent possible incurring O(n) FS >> interactions. To do that we need to allow actions of one process to be >> reused by other processes on other OS instances. >> >> The openg() call allows one process to perform name resolution, which is >> often the most expensive part of this use model. Because permission > > If the name resolution is the most expensive part, why not implement > just the name lookup part and call it "lookup" instead of "openg". Or > even better, make NFS to resolve multiple names with a single request. > If the NFS server caches the last few name lookups, the responses from > the other nodes will be fast, and you will get your file descriptor > with two instead of the proposed one request. The performance could be > just good enough without introducing any new functions and file > handles. > > Thanks, > Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 0:37 ` Rob Ross @ 2006-12-05 10:02 ` Christoph Hellwig 2006-12-05 16:47 ` Latchesar Ionkov 1 sibling, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-12-05 10:02 UTC (permalink / raw) To: Rob Ross Cc: Latchesar Ionkov, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On Mon, Dec 04, 2006 at 06:37:38PM -0600, Rob Ross wrote: > I think the subject line might be a little misleading; we're not just > talking about NFS here. There are a number of different file systems > that might benefit from these enhancements (e.g. GPFS, Lustre, PVFS, > PanFS, etc.). Any support for advance filesystem semantics will definitly not be available to propritary filesystems like GPFS that violate our copyrights blatantly. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 0:37 ` Rob Ross 2006-12-05 10:02 ` Christoph Hellwig @ 2006-12-05 16:47 ` Latchesar Ionkov 2006-12-05 17:01 ` Matthew Wilcox ` (2 more replies) 1 sibling, 3 replies; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-05 16:47 UTC (permalink / raw) To: Rob Ross; +Cc: Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: > Hi, > > I agree that it is not feasible to add new system calls every time > somebody has a problem, and we don't take adding system calls lightly. > However, in this case we're talking about an entire *community* of > people (high-end computing), not just one or two people. Of course it > may still be the case that that community is not important enough to > justify the addition of system calls; that's obviously not my call to make! I have the feeling that openg stuff is rushed without looking into all solutions, that don't require changes to the current interface. I don't see any numbers showing where exactly the time is spent? Is opening too slow because of the number of requests that the file server suddently has to respond to? Does having an operation that looks up multiple names instead of a single name good enough? How much time is spent on opening the file once you have resolved the name? > I'm sure that you meant more than just to rename openg() to lookup(), > but I don't understand what you are proposing. We still need a second > call to take the results of the lookup (by whatever name) and convert > that into a file descriptor. That's all the openfh() (previously named > sutoc()) is for. The idea is that lookup doesn't open the file, just does to name resolution. The actual opening is done by openfh (or whatever you call it next :). I don't think it is a good idea to introduce another way of addressing files on the file system at all, but if you still decide to do it, it makes more sense to separate the name resolution from the operations (at the moment only open operation, but who knows what'll somebody think of next;) you want to do on the file. > I think the subject line might be a little misleading; we're not just > talking about NFS here. There are a number of different file systems > that might benefit from these enhancements (e.g. GPFS, Lustre, PVFS, > PanFS, etc.). I think that the main problem is that all these file systems resove a path name, one directory at a time bringing the server to its knees by the huge amount of requests. I would like to see what the performance is if you a) cache the last few hundred lookups on the server side, and b) modify VFS and the file systems to support multi-name lookups. Just assume for a moment that there is no any way to get these new operations in (which is probaly going to be true anyway :). What other solutions can you think of? :) Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 16:47 ` Latchesar Ionkov @ 2006-12-05 17:01 ` Matthew Wilcox [not found] ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com> 2006-12-05 21:50 ` Peter Staubach 2006-12-05 21:44 ` Rob Ross 2006-12-06 9:48 ` David Chinner 2 siblings, 2 replies; 124+ messages in thread From: Matthew Wilcox @ 2006-12-05 17:01 UTC (permalink / raw) To: Latchesar Ionkov; +Cc: Rob Ross, Christoph Hellwig, Gary Grider, linux-fsdevel On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > I think that the main problem is that all these file systems resove a > path name, one directory at a time bringing the server to its knees by > the huge amount of requests. I would like to see what the performance > is if you a) cache the last few hundred lookups on the server side, > and b) modify VFS and the file systems to support multi-name lookups. > Just assume for a moment that there is no any way to get these new > operations in (which is probaly going to be true anyway :). What other > solutions can you think of? :) How exactly would you want a multi-name lookup to work? Are you saying that open("/usr/share/misc/pci.ids") should ask the server "Find usr, if you find it, find share, if you find it, find misc, if you find it, find pci.ids"? That would be potentially very wasteful; consider mount points, symlinks and other such effects on the namespace. You could ask the server to do a lot of work which you then discard ... and that's not efficient. ^ permalink raw reply [flat|nested] 124+ messages in thread
[parent not found: <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com>]
* Re: Re: Re: NFSv4/pNFS possible POSIX I/O API standards [not found] ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com> @ 2006-12-05 17:10 ` Latchesar Ionkov 2006-12-05 17:39 ` Matthew Wilcox 1 sibling, 0 replies; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-05 17:10 UTC (permalink / raw) To: linux-fsdevel ---------- Forwarded message ---------- From: Latchesar Ionkov <lionkov@lanl.gov> Date: Dec 5, 2006 6:09 PM Subject: Re: Re: Re: NFSv4/pNFS possible POSIX I/O API standards To: Matthew Wilcox <matthew@wil.cx> On 12/5/06, Matthew Wilcox <matthew@wil.cx> wrote: > On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > > I think that the main problem is that all these file systems resove a > > path name, one directory at a time bringing the server to its knees by > > the huge amount of requests. I would like to see what the performance > > is if you a) cache the last few hundred lookups on the server side, > > and b) modify VFS and the file systems to support multi-name lookups. > > Just assume for a moment that there is no any way to get these new > > operations in (which is probaly going to be true anyway :). What other > > solutions can you think of? :) > > How exactly would you want a multi-name lookup to work? Are you saying > that open("/usr/share/misc/pci.ids") should ask the server "Find usr, if > you find it, find share, if you find it, find misc, if you find it, find > pci.ids"? That would be potentially very wasteful; consider mount > points, symlinks and other such effects on the namespace. You could ask > the server to do a lot of work which you then discard ... and that's not > efficient. It could be wasteful, but it could (most likely) also be useful. Name resolution is not that expensive on either side of the network. The latency introduced by the single-name lookups is :) Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: Re: NFSv4/pNFS possible POSIX I/O API standards [not found] ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com> 2006-12-05 17:10 ` Latchesar Ionkov @ 2006-12-05 17:39 ` Matthew Wilcox 2006-12-05 21:55 ` Rob Ross 1 sibling, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-12-05 17:39 UTC (permalink / raw) To: Latchesar Ionkov; +Cc: linux-fsdevel On Tue, Dec 05, 2006 at 06:09:03PM +0100, Latchesar Ionkov wrote: > It could be wasteful, but it could (most likely) also be useful. Name > resolution is not that expensive on either side of the network. The > latency introduced by the single-name lookups is :) *is* latency the problem here? Last I heard, it was the intolerable load placed on the DLM by having clients bounce the read locks for each directory element all over the cluster. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 17:39 ` Matthew Wilcox @ 2006-12-05 21:55 ` Rob Ross 0 siblings, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-05 21:55 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Latchesar Ionkov, linux-fsdevel Matthew Wilcox wrote: > On Tue, Dec 05, 2006 at 06:09:03PM +0100, Latchesar Ionkov wrote: >> It could be wasteful, but it could (most likely) also be useful. Name >> resolution is not that expensive on either side of the network. The >> latency introduced by the single-name lookups is :) > > *is* latency the problem here? Last I heard, it was the intolerable > load placed on the DLM by having clients bounce the read locks for each > directory element all over the cluster. I think you're both right: it's either the time spent on all the actual lookups or the time involved in all the lock traffic, depending on FS and network of course. Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 17:01 ` Matthew Wilcox [not found] ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com> @ 2006-12-05 21:50 ` Peter Staubach 1 sibling, 0 replies; 124+ messages in thread From: Peter Staubach @ 2006-12-05 21:50 UTC (permalink / raw) To: Matthew Wilcox Cc: Latchesar Ionkov, Rob Ross, Christoph Hellwig, Gary Grider, linux-fsdevel Matthew Wilcox wrote: > On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > >> I think that the main problem is that all these file systems resove a >> path name, one directory at a time bringing the server to its knees by >> the huge amount of requests. I would like to see what the performance >> is if you a) cache the last few hundred lookups on the server side, >> and b) modify VFS and the file systems to support multi-name lookups. >> Just assume for a moment that there is no any way to get these new >> operations in (which is probaly going to be true anyway :). What other >> solutions can you think of? :) >> > > How exactly would you want a multi-name lookup to work? Are you saying > that open("/usr/share/misc/pci.ids") should ask the server "Find usr, if > you find it, find share, if you find it, find misc, if you find it, find > pci.ids"? That would be potentially very wasteful; consider mount > points, symlinks and other such effects on the namespace. You could ask > the server to do a lot of work which you then discard ... and that's not > efficient. It could be inefficient, as pointed out, but defined right, it could greatly reduce the number of over the wire trips. The client can already tell from its own namespace when a submount may be encountered, so know not to utilize the multicomponent pathname lookup facility. The requirements could state that the server stops when it encounters a non-directory/non-regular file node in the namespace. This sort of thing... ps ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 16:47 ` Latchesar Ionkov 2006-12-05 17:01 ` Matthew Wilcox @ 2006-12-05 21:44 ` Rob Ross 2006-12-06 11:01 ` openg Christoph Hellwig 2006-12-06 23:25 ` Re: NFSv4/pNFS possible POSIX I/O API standards Latchesar Ionkov 2006-12-06 9:48 ` David Chinner 2 siblings, 2 replies; 124+ messages in thread From: Rob Ross @ 2006-12-05 21:44 UTC (permalink / raw) To: Latchesar Ionkov Cc: Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel Latchesar Ionkov wrote: > On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: >> >> I agree that it is not feasible to add new system calls every time >> somebody has a problem, and we don't take adding system calls lightly. >> However, in this case we're talking about an entire *community* of >> people (high-end computing), not just one or two people. Of course it >> may still be the case that that community is not important enough to >> justify the addition of system calls; that's obviously not my call to >> make! > > I have the feeling that openg stuff is rushed without looking into all > solutions, that don't require changes to the current interface. I > don't see any numbers showing where exactly the time is spent? Is > opening too slow because of the number of requests that the file > server suddently has to respond to? Does having an operation that > looks up multiple names instead of a single name good enough? How much > time is spent on opening the file once you have resolved the name? Thanks for looking at the graph. To clarify the workload, we do not expect that application processes will be opening a large number of files all at once; that was just how the test was run to get a reasonable average value. So I don't think that something that looked up multiple file names would help for this case. I unfortunately don't have data to show exactly where the time was spent, but it's a good guess that it is all the network traffic in the open() case. >> I'm sure that you meant more than just to rename openg() to lookup(), >> but I don't understand what you are proposing. We still need a second >> call to take the results of the lookup (by whatever name) and convert >> that into a file descriptor. That's all the openfh() (previously named >> sutoc()) is for. > > The idea is that lookup doesn't open the file, just does to name > resolution. The actual opening is done by openfh (or whatever you call > it next :). I don't think it is a good idea to introduce another way > of addressing files on the file system at all, but if you still decide > to do it, it makes more sense to separate the name resolution from the > operations (at the moment only open operation, but who knows what'll > somebody think of next;) you want to do on the file. I really think that we're saying the same thing here? I think of the open() call as doing two (maybe three) things. First, performs name resolution and permission checking. Second, creates the file descriptor that allows the user process to do subsequent I/O. Third, creates a context for access, if the FS keeps track of "open" files (not all do). The openg() really just does the lookup and permission checking). The openfh() creates the file descriptor and starts that context if the particular FS tracks that sort of thing. >> I think the subject line might be a little misleading; we're not just >> talking about NFS here. There are a number of different file systems >> that might benefit from these enhancements (e.g. GPFS, Lustre, PVFS, >> PanFS, etc.). > > I think that the main problem is that all these file systems resove a > path name, one directory at a time bringing the server to its knees by > the huge amount of requests. I would like to see what the performance > is if you a) cache the last few hundred lookups on the server side, > and b) modify VFS and the file systems to support multi-name lookups. > Just assume for a moment that there is no any way to get these new > operations in (which is probaly going to be true anyway :). What other > solutions can you think of? :) Well you've caught me. I don't want to cache the values, because I fundamentally believe that sharing state between clients and servers is braindead (to use Christoph's phrase) in systems of this scale (thousands to tens of thousands of clients). So I don't want locks, so I can't keep the cache consistent, ... So someone else will have to run the tests you propose :)... Also, to address Christoph's snipe while we're here; I don't care one way or another whether the Linux community wants to help GPFS or not. I do care that I'm arguing for something that is useful to more than just my own pet project, and that was the point that I was trying to make. I'll be sure not to mention GPFS again. What's the etiquette on changing subject lines here? It might be useful to separate the openg() etc. discussion from the readdirplus() etc. discussion. Thanks again for the comments, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* openg 2006-12-05 21:44 ` Rob Ross @ 2006-12-06 11:01 ` Christoph Hellwig 2006-12-06 15:41 ` openg Trond Myklebust 2006-12-06 15:42 ` openg Rob Ross 2006-12-06 23:25 ` Re: NFSv4/pNFS possible POSIX I/O API standards Latchesar Ionkov 1 sibling, 2 replies; 124+ messages in thread From: Christoph Hellwig @ 2006-12-06 11:01 UTC (permalink / raw) To: Rob Ross Cc: Latchesar Ionkov, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On Tue, Dec 05, 2006 at 03:44:31PM -0600, Rob Ross wrote: > The openg() really just does the lookup and permission checking). The > openfh() creates the file descriptor and starts that context if the > particular FS tracks that sort of thing. ... > Well you've caught me. I don't want to cache the values, because I > fundamentally believe that sharing state between clients and servers is > braindead (to use Christoph's phrase) in systems of this scale > (thousands to tens of thousands of clients). So I don't want locks, so I > can't keep the cache consistent, ... So someone else will have to run > the tests you propose :)... Besides the whole ugliness you miss a few points about the fundamental architecture of the unix filesystem permission model unfortunately. Say you want to lookup a path /foo/bar/baz, then the access permission is based on the following things: - the credentials of the user. let's only take traditional uid/gid for this example although credentials are much more complex these days - the kind of operation you want to perform - the access permission of the actual object the path points to (inode) - the lookup permission (x bit) for every object on the way to you object In your proposal sutoc is a simple conversion operation, that means openg needs to perfom all these access checks and encodes them in the fh_t. That means an fh_t must fundamentally be an object that is kept in the kernel aka a capability as defined by Henry Levy. This does imply you _do_ need to keep state. And because it needs kernel support you fh_t is more or less equivalent to a file descriptor with sutoc equivalent to a dup variant that really duplicates the backing object instead of just the userspace index into it. Note somewhat similar open by filehandle APIs like oben by inode number as used by lustre or the XFS *_by_handle APIs are privilegued operations because of exactly this problem. What according to your mail is the most important bit in this proposal is that you thing the filehandles should be easily shared with other system in a cluster. That fact is not mentioned in the actual proposal at all, and is in fact that hardest part because of inherent statefulness of the API. > What's the etiquette on changing subject lines here? It might be useful > to separate the openg() etc. discussion from the readdirplus() etc. > discussion. Changing subject lines is fine. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg 2006-12-06 11:01 ` openg Christoph Hellwig @ 2006-12-06 15:41 ` Trond Myklebust 2006-12-06 15:42 ` openg Rob Ross 1 sibling, 0 replies; 124+ messages in thread From: Trond Myklebust @ 2006-12-06 15:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Rob Ross, Latchesar Ionkov, Matthew Wilcox, Gary Grider, linux-fsdevel On Wed, 2006-12-06 at 11:01 +0000, Christoph Hellwig wrote: > Say you want to lookup a path /foo/bar/baz, then the access permission > is based on the following things: > > - the credentials of the user. let's only take traditional uid/gid > for this example although credentials are much more complex these > days > - the kind of operation you want to perform > - the access permission of the actual object the path points to (inode) > - the lookup permission (x bit) for every object on the way to you object - your private namespace particularities (submounts etc) Trond ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg 2006-12-06 11:01 ` openg Christoph Hellwig 2006-12-06 15:41 ` openg Trond Myklebust @ 2006-12-06 15:42 ` Rob Ross 2006-12-06 23:32 ` openg Christoph Hellwig 1 sibling, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-06 15:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Latchesar Ionkov, Matthew Wilcox, Gary Grider, linux-fsdevel Christoph Hellwig wrote: > On Tue, Dec 05, 2006 at 03:44:31PM -0600, Rob Ross wrote: >> The openg() really just does the lookup and permission checking). The >> openfh() creates the file descriptor and starts that context if the >> particular FS tracks that sort of thing. > > ... > >> Well you've caught me. I don't want to cache the values, because I >> fundamentally believe that sharing state between clients and servers is >> braindead (to use Christoph's phrase) in systems of this scale >> (thousands to tens of thousands of clients). So I don't want locks, so I >> can't keep the cache consistent, ... So someone else will have to run >> the tests you propose :)... > > Besides the whole ugliness you miss a few points about the fundamental > architecture of the unix filesystem permission model unfortunately. > > Say you want to lookup a path /foo/bar/baz, then the access permission > is based on the following things: > > - the credentials of the user. let's only take traditional uid/gid > for this example although credentials are much more complex these > days > - the kind of operation you want to perform > - the access permission of the actual object the path points to (inode) > - the lookup permission (x bit) for every object on the way to you object > > In your proposal sutoc is a simple conversion operation, that means > openg needs to perfom all these access checks and encodes them in the > fh_t. This is exactly right and is the intention of the call. > That means an fh_t must fundamentally be an object that is kept > in the kernel aka a capability as defined by Henry Levy. This does imply > you _do_ need to keep state. The fh_t is indeed a type of capability. fh_t, properly protected, could be passed into user space and validated by the file system when presented back to the file system. There is state here, clearly. I feel ok about that because we allow servers to forget that they handed out these fh_ts if they feel like it; there is no guaranteed lifetime in the current proposal. This allows servers to come and go without needing to persistently store these. Likewise, clients can forget them with no real penalty. This approach is ok because of the use case. Because we expect the fh_t to be used relatively soon after its creation, servers will not need to hold onto these long before the openfh() is performed and we're back into a normal "everyone has an valid fd" use case. > And because it needs kernel support you > fh_t is more or less equivalent to a file descriptor with sutoc equivalent > to a dup variant that really duplicates the backing object instead of just > the userspace index into it. Well, a FD has some additional state associated with it (position, etc.), but yes there are definitely similarities to dup(). > Note somewhat similar open by filehandle APIs like oben by inode number > as used by lustre or the XFS *_by_handle APIs are privilegued operations > because of exactly this problem. I'm not sure what a properly protected fh_t couldn't be passed back into user space and handed around, but I'm not a security expert. What am I missing? > What according to your mail is the most important bit in this proposal is > that you thing the filehandles should be easily shared with other system > in a cluster. That fact is not mentioned in the actual proposal at all, > and is in fact that hardest part because of inherent statefulness of > the API. The documentation of the calls is complicated by the way POSIX calls are described. We need to have a second document describing use cases also available, so that we can avoid misunderstandings as best we can, get straight to the real issues. Sorry that document wasn't available. I think I've addressed the statefulness of the API above? >> What's the etiquette on changing subject lines here? It might be useful >> to separate the openg() etc. discussion from the readdirplus() etc. >> discussion. > > Changing subject lines is fine. Thanks. Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg 2006-12-06 15:42 ` openg Rob Ross @ 2006-12-06 23:32 ` Christoph Hellwig 2006-12-14 23:36 ` openg Rob Ross 0 siblings, 1 reply; 124+ messages in thread From: Christoph Hellwig @ 2006-12-06 23:32 UTC (permalink / raw) To: Rob Ross Cc: Christoph Hellwig, Latchesar Ionkov, Matthew Wilcox, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 09:42:47AM -0600, Rob Ross wrote: > The fh_t is indeed a type of capability. fh_t, properly protected, could > be passed into user space and validated by the file system when > presented back to the file system. Well, there's quite a lot of papers on how to implement properly secure capabilities. The only performant way to do it is to implement them in kernel space or with hardware support. As soon as you pass them to userspace the user can manipulate them, and doing a cheap enough verification is non-trivial (e.g. it doesn't buy you anything if you spent the time you previously spent for lookup roundtrip latency for some expensive cryptography) > There is state here, clearly. I feel ok about that because we allow > servers to forget that they handed out these fh_ts if they feel like it; > there is no guaranteed lifetime in the current proposal. This allows > servers to come and go without needing to persistently store these. > Likewise, clients can forget them with no real penalty. Objects without defined lifetime rules are not something we're very keen on. Particularly in userspace interface they will cause all kinds of trouble because people will expect the lifetime rules they get from their normal filesystems. > The documentation of the calls is complicated by the way POSIX calls are > described. We need to have a second document describing use cases also > available, so that we can avoid misunderstandings as best we can, get > straight to the real issues. Sorry that document wasn't available. The real problem is that you want to do something in a POSIX spec that is fundamentally out of scope. POSIX .1 deals with system interfaces on a single system. You want to specify semantics over multiple systems in a cluster. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg 2006-12-06 23:32 ` openg Christoph Hellwig @ 2006-12-14 23:36 ` Rob Ross 0 siblings, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-14 23:36 UTC (permalink / raw) To: Christoph Hellwig Cc: Latchesar Ionkov, Matthew Wilcox, Gary Grider, linux-fsdevel Christoph Hellwig wrote: > On Wed, Dec 06, 2006 at 09:42:47AM -0600, Rob Ross wrote: >> The fh_t is indeed a type of capability. fh_t, properly protected, could >> be passed into user space and validated by the file system when >> presented back to the file system. > > Well, there's quite a lot of papers on how to implement properly secure > capabilities. The only performant way to do it is to implement them > in kernel space or with hardware support. As soon as you pass them > to userspace the user can manipulate them, and doing a cheap enough > verification is non-trivial (e.g. it doesn't buy you anything if you > spent the time you previously spent for lookup roundtrip latency > for some expensive cryptography) I agree that if the cryptographic verification took longer than the N namespace traversals and permission checking that would occur in the other case, that this would be a silly proposal. honestly that didn't occur to me as even remotely possible, especially given that in most cases the server will be verifying the exact same handle lots of times, rather than needing to verify a large number of different handles simultaneously (in which case there's no point in using openg()/openfh()). > Objects without defined lifetime rules are not something we're very keen > on. Particularly in userspace interface they will cause all kinds of > trouble because people will expect the lifetime rules they get from their > normal filesystems. I agree that not being able to clearly define the lifetime of the handle is suboptimal. If the handle is a capability, then its lifetime would be bounded only by potential revocations of the capability, the same way an open FD might then suddenly cease to be valid. On the other hand, in Andreas' "open file handle" implementation the handle might have a shorter lifetime. We're attempting to allow for the underlying FS to implement this in the most natural way for that file system. Those mechanisms lead to different lifetimes. This would bother me quite a bit *if* it complicated the use model, but it really doesn't, particularly because less savvy users are likely to just continue to use open() anyway. >> The documentation of the calls is complicated by the way POSIX calls are >> described. We need to have a second document describing use cases also >> available, so that we can avoid misunderstandings as best we can, get >> straight to the real issues. Sorry that document wasn't available. > > The real problem is that you want to do something in a POSIX spec that > is fundamentally out of scope. POSIX .1 deals with system interfaces > on a single system. You want to specify semantics over multiple systems > in a cluster. I agree; the real problem is that POSIX .1 is being used to specify semantics over multiple systems in a cluster. But we're stuck with that. Thanks, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 21:44 ` Rob Ross 2006-12-06 11:01 ` openg Christoph Hellwig @ 2006-12-06 23:25 ` Latchesar Ionkov 1 sibling, 0 replies; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-06 23:25 UTC (permalink / raw) To: Rob Ross; +Cc: Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: > I unfortunately don't have data to show exactly where the time was > spent, but it's a good guess that it is all the network traffic in the > open() case. Is it hard to repeat the test and check what requests (and how much time do they take) PVFS server receives? > Well you've caught me. I don't want to cache the values, because I > fundamentally believe that sharing state between clients and servers is > braindead (to use Christoph's phrase) in systems of this scale > (thousands to tens of thousands of clients). So I don't want locks, so I > can't keep the cache consistent, ... Having file handles in the server looks like a cache to me :) What are the properties of a cache that it lacks? Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Re: NFSv4/pNFS possible POSIX I/O API standards 2006-12-05 16:47 ` Latchesar Ionkov 2006-12-05 17:01 ` Matthew Wilcox 2006-12-05 21:44 ` Rob Ross @ 2006-12-06 9:48 ` David Chinner 2006-12-06 15:53 ` openg and path_to_handle Rob Ross 2 siblings, 1 reply; 124+ messages in thread From: David Chinner @ 2006-12-06 9:48 UTC (permalink / raw) To: Latchesar Ionkov Cc: Rob Ross, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: > >Hi, > > > >I agree that it is not feasible to add new system calls every time > >somebody has a problem, and we don't take adding system calls lightly. > >However, in this case we're talking about an entire *community* of > >people (high-end computing), not just one or two people. Of course it > >may still be the case that that community is not important enough to > >justify the addition of system calls; that's obviously not my call to make! > > I have the feeling that openg stuff is rushed without looking into all > solutions, that don't require changes to the current interface. I also get the feeling that interfaces that already do this open-by-handle stuff haven't been explored either. Does anyone here know about the XFS libhandle API? This has been around for years and it does _exactly_ what these proposed syscalls are supposed to do (and more). See: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle For the libhandle man page. Basically: openg == path_to_handle sutoc == open_by_handle And here for the userspace code: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 124+ messages in thread
* openg and path_to_handle 2006-12-06 9:48 ` David Chinner @ 2006-12-06 15:53 ` Rob Ross 2006-12-06 16:04 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 124+ messages in thread From: Rob Ross @ 2006-12-06 15:53 UTC (permalink / raw) To: David Chinner Cc: Latchesar Ionkov, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel David Chinner wrote: > On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: >> On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: >>> Hi, >>> >>> I agree that it is not feasible to add new system calls every time >>> somebody has a problem, and we don't take adding system calls lightly. >>> However, in this case we're talking about an entire *community* of >>> people (high-end computing), not just one or two people. Of course it >>> may still be the case that that community is not important enough to >>> justify the addition of system calls; that's obviously not my call to make! >> I have the feeling that openg stuff is rushed without looking into all >> solutions, that don't require changes to the current interface. > > I also get the feeling that interfaces that already do this > open-by-handle stuff haven't been explored either. > > Does anyone here know about the XFS libhandle API? This has been > around for years and it does _exactly_ what these proposed syscalls > are supposed to do (and more). > > See: > > http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle > > For the libhandle man page. Basically: > > openg == path_to_handle > sutoc == open_by_handle > > And here for the userspace code: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ > > Cheers, > > Dave. Thanks for pointing these out Dave. These are indeed along the same lines as the openg()/openfh() approach. One difference is that they appear to perform permission checking on the open_by_handle(), which means that the entire path needs to be encoded in the handle, and makes it difficult to eliminate the path traversal overhead on N open_by_handle() operations. Regards, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 15:53 ` openg and path_to_handle Rob Ross @ 2006-12-06 16:04 ` Matthew Wilcox 2006-12-06 16:20 ` Rob Ross 2006-12-06 20:40 ` David Chinner 2006-12-06 23:19 ` Latchesar Ionkov 2 siblings, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-12-06 16:04 UTC (permalink / raw) To: Rob Ross Cc: David Chinner, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > David Chinner wrote: > >Does anyone here know about the XFS libhandle API? This has been > >around for years and it does _exactly_ what these proposed syscalls > >are supposed to do (and more). > > Thanks for pointing these out Dave. These are indeed along the same > lines as the openg()/openfh() approach. > > One difference is that they appear to perform permission checking on the > open_by_handle(), which means that the entire path needs to be encoded > in the handle, and makes it difficult to eliminate the path traversal > overhead on N open_by_handle() operations. Another (and highly important) difference is that usage is restricted to root: xfs_open_by_handle(...) ... if (!capable(CAP_SYS_ADMIN)) return -XFS_ERROR(EPERM); ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 16:04 ` Matthew Wilcox @ 2006-12-06 16:20 ` Rob Ross 2006-12-06 20:57 ` David Chinner 0 siblings, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-06 16:20 UTC (permalink / raw) To: Matthew Wilcox Cc: David Chinner, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel Matthew Wilcox wrote: > On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: >> David Chinner wrote: >>> Does anyone here know about the XFS libhandle API? This has been >>> around for years and it does _exactly_ what these proposed syscalls >>> are supposed to do (and more). >> Thanks for pointing these out Dave. These are indeed along the same >> lines as the openg()/openfh() approach. >> >> One difference is that they appear to perform permission checking on the >> open_by_handle(), which means that the entire path needs to be encoded >> in the handle, and makes it difficult to eliminate the path traversal >> overhead on N open_by_handle() operations. > > Another (and highly important) difference is that usage is restricted to > root: > > xfs_open_by_handle(...) > ... > if (!capable(CAP_SYS_ADMIN)) > return -XFS_ERROR(EPERM); I assume that this is because the implementation chose not to do the path encoding in the handle? Because if they did, they could do full path permission checking as part of the open_by_handle. Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 16:20 ` Rob Ross @ 2006-12-06 20:57 ` David Chinner 0 siblings, 0 replies; 124+ messages in thread From: David Chinner @ 2006-12-06 20:57 UTC (permalink / raw) To: Rob Ross Cc: Matthew Wilcox, David Chinner, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 10:20:23AM -0600, Rob Ross wrote: > Matthew Wilcox wrote: > >On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > >>David Chinner wrote: > >>>Does anyone here know about the XFS libhandle API? This has been > >>>around for years and it does _exactly_ what these proposed syscalls > >>>are supposed to do (and more). > >>Thanks for pointing these out Dave. These are indeed along the same > >>lines as the openg()/openfh() approach. > >> > >>One difference is that they appear to perform permission checking on the > >>open_by_handle(), which means that the entire path needs to be encoded > >>in the handle, and makes it difficult to eliminate the path traversal > >>overhead on N open_by_handle() operations. > > > >Another (and highly important) difference is that usage is restricted to > >root: > > > >xfs_open_by_handle(...) > >... > > if (!capable(CAP_SYS_ADMIN)) > > return -XFS_ERROR(EPERM); > > I assume that this is because the implementation chose not to do the > path encoding in the handle? Because if they did, they could do full > path permission checking as part of the open_by_handle. The original use of this interface (if I understand the Irix history correctly - this is way before my time at SGI) was a userspace NFS server and so permission checks were done after the filehandle was opened and a stat could be done on the fd and mode/uid/gid could be compared to what was in the NFS request. Paths were never needed for this because everything needed could be obtained directly from the inode. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 15:53 ` openg and path_to_handle Rob Ross 2006-12-06 16:04 ` Matthew Wilcox @ 2006-12-06 20:40 ` David Chinner 2006-12-06 20:50 ` Matthew Wilcox 2006-12-06 20:50 ` Rob Ross 2006-12-06 23:19 ` Latchesar Ionkov 2 siblings, 2 replies; 124+ messages in thread From: David Chinner @ 2006-12-06 20:40 UTC (permalink / raw) To: Rob Ross Cc: David Chinner, Latchesar Ionkov, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > David Chinner wrote: > >On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > >>On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: > >>>Hi, > >>> > >>>I agree that it is not feasible to add new system calls every time > >>>somebody has a problem, and we don't take adding system calls lightly. > >>>However, in this case we're talking about an entire *community* of people > >>>(high-end computing), not just one or two people. Of course it may still > >>>be the case that that community is not important enough to justify the > >>>addition of system calls; that's obviously not my call to make! > >>I have the feeling that openg stuff is rushed without looking into all > >>solutions, that don't require changes to the current interface. > > > >I also get the feeling that interfaces that already do this open-by-handle > >stuff haven't been explored either. > > > >Does anyone here know about the XFS libhandle API? This has been around for > >years and it does _exactly_ what these proposed syscalls are supposed to do > >(and more). > > > >See: > > > >http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle > > > >For the libhandle man page. Basically: > > > >openg == path_to_handle sutoc == open_by_handle > > > >And here for the userspace code: > > > >http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ > > > >Cheers, > > > >Dave. > > Thanks for pointing these out Dave. These are indeed along the same lines as > the openg()/openfh() approach. > > One difference is that they appear to perform permission checking on the > open_by_handle(), which means that the entire path needs to be encoded in > the handle, and makes it difficult to eliminate the path traversal overhead > on N open_by_handle() operations. open_by_handle() is checking the inode flags for things like immutibility and whether the inode is writable to determine if the open mode is valid given these flags. It's not actually checking permissions. IOWs, open_by_handle() has the same overhead as NFS filehandle to inode translation; i.e. no path traversal on open. Permission checks are done on the path_to_handle(), so in reality only root or CAP_SYS_ADMIN users can currently use the open_by_handle interface because of this lack of checking. Given that our current users of this interface need root permissions to do other things (data migration), this has never been an issue. This is an implementation detail - it is possible that file handle, being opaque, could encode a UID/GID of the user that constructed the handle and then allow any process with the same UID/GID to use open_by_handle() on that handle. (I think hch has already pointed this out.) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 20:40 ` David Chinner @ 2006-12-06 20:50 ` Matthew Wilcox 2006-12-06 21:09 ` David Chinner 2006-12-06 22:09 ` Andreas Dilger 2006-12-06 20:50 ` Rob Ross 1 sibling, 2 replies; 124+ messages in thread From: Matthew Wilcox @ 2006-12-06 20:50 UTC (permalink / raw) To: David Chinner Cc: Rob Ross, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: > Permission checks are done on the path_to_handle(), so in reality > only root or CAP_SYS_ADMIN users can currently use the > open_by_handle interface because of this lack of checking. Given > that our current users of this interface need root permissions to do > other things (data migration), this has never been an issue. > > This is an implementation detail - it is possible that file handle, > being opaque, could encode a UID/GID of the user that constructed > the handle and then allow any process with the same UID/GID to use > open_by_handle() on that handle. (I think hch has already pointed > this out.) While it could do that, I'd be interested to see how you'd construct the handle such that it's immune to a malicious user tampering with it, or saving it across a reboot, or constructing one from scratch. I suspect any real answer to this would have to involve cryptographical techniques (say, creating a secure hash of the information plus a boot-time generated nonce). Now you're starting to use a lot of bits, and compute time, and you'll need to be sure to keep the nonce secret. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 20:50 ` Matthew Wilcox @ 2006-12-06 21:09 ` David Chinner 2006-12-06 22:09 ` Andreas Dilger 1 sibling, 0 replies; 124+ messages in thread From: David Chinner @ 2006-12-06 21:09 UTC (permalink / raw) To: Matthew Wilcox Cc: David Chinner, Rob Ross, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 01:50:24PM -0700, Matthew Wilcox wrote: > On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: > > Permission checks are done on the path_to_handle(), so in reality > > only root or CAP_SYS_ADMIN users can currently use the > > open_by_handle interface because of this lack of checking. Given > > that our current users of this interface need root permissions to do > > other things (data migration), this has never been an issue. > > > > This is an implementation detail - it is possible that file handle, > > being opaque, could encode a UID/GID of the user that constructed > > the handle and then allow any process with the same UID/GID to use > > open_by_handle() on that handle. (I think hch has already pointed > > this out.) > > While it could do that, I'd be interested to see how you'd construct > the handle such that it's immune to a malicious user tampering with it, > or saving it across a reboot, or constructing one from scratch. > > I suspect any real answer to this would have to involve cryptographical > techniques (say, creating a secure hash of the information plus a > boot-time generated nonce). Now you're starting to use a lot of bits, > and compute time, and you'll need to be sure to keep the nonce secret. An auth header and GSS-API integration would probably be the way to go here if you really care. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 20:50 ` Matthew Wilcox 2006-12-06 21:09 ` David Chinner @ 2006-12-06 22:09 ` Andreas Dilger 2006-12-06 22:17 ` Matthew Wilcox 2006-12-06 23:39 ` Christoph Hellwig 1 sibling, 2 replies; 124+ messages in thread From: Andreas Dilger @ 2006-12-06 22:09 UTC (permalink / raw) To: Matthew Wilcox Cc: David Chinner, Rob Ross, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Dec 06, 2006 13:50 -0700, Matthew Wilcox wrote: > On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: > > This is an implementation detail - it is possible that file handle, > > being opaque, could encode a UID/GID of the user that constructed > > the handle and then allow any process with the same UID/GID to use > > open_by_handle() on that handle. (I think hch has already pointed > > this out.) > > While it could do that, I'd be interested to see how you'd construct > the handle such that it's immune to a malicious user tampering with it, > or saving it across a reboot, or constructing one from scratch. If the server has to have processed a real "open" request, say within the preceding 30s, then it would have a handle for openfh() to match against. If the server reboots, or a client tries to construct a new handle from scratch, or even tries to use the handle after the file is closed then the handle would be invalid. It isn't just an encoding for "open-by-inum", but rather a handle that references some just-created open file handle on the server. That the handle might contain the UID/GID is mostly irrelevant - either the process + network is trusted to pass the handle around without snooping, or a malicious client which intercepts the handle can spoof the UID/GID just as easily. Make the handle sufficiently large to avoid guessing and it is "secure enough" until the whole filesystem is using kerberos to avoid any number of other client/user spoofing attacks. Considering that filesystems like GFS and OCFS allow clients DIRECT ACCESS to the block device itself (which no amount of authentication will fix, unless it is in the disks themselves), the risk of passing a file handle around is pretty minimal. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 22:09 ` Andreas Dilger @ 2006-12-06 22:17 ` Matthew Wilcox 2006-12-06 22:41 ` Andreas Dilger 2006-12-06 23:39 ` Christoph Hellwig 1 sibling, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-12-06 22:17 UTC (permalink / raw) To: David Chinner, Rob Ross, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: > Considering that filesystems like GFS and OCFS allow clients DIRECT > ACCESS to the block device itself (which no amount of authentication > will fix, unless it is in the disks themselves), the risk of passing a > file handle around is pretty minimal. That's either disingenuous, or missing the point. OCFS/GFS allow the kernel direct access to the block device. openg()&sutoc() are about passing around file handles to untrusted users. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 22:17 ` Matthew Wilcox @ 2006-12-06 22:41 ` Andreas Dilger 0 siblings, 0 replies; 124+ messages in thread From: Andreas Dilger @ 2006-12-06 22:41 UTC (permalink / raw) To: Matthew Wilcox Cc: David Chinner, Rob Ross, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Dec 06, 2006 15:17 -0700, Matthew Wilcox wrote: > On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: > > Considering that filesystems like GFS and OCFS allow clients DIRECT > > ACCESS to the block device itself (which no amount of authentication > > will fix, unless it is in the disks themselves), the risk of passing a > > file handle around is pretty minimal. > > That's either disingenuous, or missing the point. OCFS/GFS allow the > kernel direct access to the block device. openg()&sutoc() are about > passing around file handles to untrusted users. Consider - in order to intercept the file handle on the network one would have to be root on a trusted client. The same is true for direct block access. If the network isn't to be trusted or the clients aren't to be trusted, then in the absence of strong external authentication like kerberos the whole thing just falls down (i.e. root on any client can su to an arbitrary UID/GID to access files to avoid root squash, or could intercept all of the traffic on the network anyways). With some network filesystems it is at least possible to get strong authentication and crypto, but with shared block device filesystems like OCFS/GFS/GPFS they completely rely on the fact that the network and all of the clients attached thereon are secure. If the server that did the original file open and generates the unique per-open file handle can do basic sanity checking (i.e. user doing the new open is the same, the file handle isn't stale) then that is no additional security hole. Similarly, NFS passes file handles to clients that are also used to get access to the open file without traversing the whole path each time. Those file handles are even (supposed to be) persistent over reboots. Don't get me wrong - I understand that what I propose is not secure. I'm just saying it is no LESS secure than a number of other things which already exist. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 22:09 ` Andreas Dilger 2006-12-06 22:17 ` Matthew Wilcox @ 2006-12-06 23:39 ` Christoph Hellwig 2006-12-14 22:52 ` Rob Ross 1 sibling, 1 reply; 124+ messages in thread From: Christoph Hellwig @ 2006-12-06 23:39 UTC (permalink / raw) To: Matthew Wilcox, David Chinner, Rob Ross, Latchesar Ionkov, Christoph Hellwig, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: > > While it could do that, I'd be interested to see how you'd construct > > the handle such that it's immune to a malicious user tampering with it, > > or saving it across a reboot, or constructing one from scratch. > > If the server has to have processed a real "open" request, say within > the preceding 30s, then it would have a handle for openfh() to match > against. If the server reboots, or a client tries to construct a new > handle from scratch, or even tries to use the handle after the file is > closed then the handle would be invalid. > > It isn't just an encoding for "open-by-inum", but rather a handle that > references some just-created open file handle on the server. That the > handle might contain the UID/GID is mostly irrelevant - either the > process + network is trusted to pass the handle around without snooping, > or a malicious client which intercepts the handle can spoof the UID/GID > just as easily. Make the handle sufficiently large to avoid guessing > and it is "secure enough" until the whole filesystem is using kerberos > to avoid any number of other client/user spoofing attacks. That would be fine as long as the file handle would be a kernel-level concept. The issue here is that they intent to make the whole filehandle userspace visible, for example to pass it around via mpi. As soon as an untrused user can tamper with the file descriptor we're in trouble. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 23:39 ` Christoph Hellwig @ 2006-12-14 22:52 ` Rob Ross 0 siblings, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-14 22:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Matthew Wilcox, David Chinner, Latchesar Ionkov, Gary Grider, linux-fsdevel Christoph Hellwig wrote: > On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: >>> While it could do that, I'd be interested to see how you'd construct >>> the handle such that it's immune to a malicious user tampering with it, >>> or saving it across a reboot, or constructing one from scratch. >> If the server has to have processed a real "open" request, say within >> the preceding 30s, then it would have a handle for openfh() to match >> against. If the server reboots, or a client tries to construct a new >> handle from scratch, or even tries to use the handle after the file is >> closed then the handle would be invalid. >> >> It isn't just an encoding for "open-by-inum", but rather a handle that >> references some just-created open file handle on the server. That the >> handle might contain the UID/GID is mostly irrelevant - either the >> process + network is trusted to pass the handle around without snooping, >> or a malicious client which intercepts the handle can spoof the UID/GID >> just as easily. Make the handle sufficiently large to avoid guessing >> and it is "secure enough" until the whole filesystem is using kerberos >> to avoid any number of other client/user spoofing attacks. > > That would be fine as long as the file handle would be a kernel-level > concept. The issue here is that they intent to make the whole filehandle > userspace visible, for example to pass it around via mpi. As soon as > an untrused user can tamper with the file descriptor we're in trouble. I guess it could reference some "just-created open file handle" on the server, if the server tracks that sort of thing. Or it could be a capability, as mentioned previously. So it isn't necessary to tie this to an open, but I think that would be a reasonable underlying implementation for a file system that tracks opens. If clients can survive a server reboot without a remount, then even this implementation should continue to operate if a server were rebooted, because the open file context would be reconstructed. If capabilities were being employed, we could likewise survive a server reboot. But this issue of server reboots isn't that critical -- the use case has the handle being reused relatively quickly after the initial openg(), and clients have a clean fallback in the event that the handle is no longer valid -- just use open(). Visibility of the handle to a user does not imply that the user can effectively tamper with the handle. A cryptographically secure one-way hash of the data, stored in the handle itself, would allow servers to verify that the handle wasn't tampered with, or that the client just made up a handle from scratch. The server managing the metadata for that file would not need to share its nonce with other servers, assuming that single servers are responsible for particular files. Regards, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 20:40 ` David Chinner 2006-12-06 20:50 ` Matthew Wilcox @ 2006-12-06 20:50 ` Rob Ross 2006-12-06 21:01 ` David Chinner 1 sibling, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-06 20:50 UTC (permalink / raw) To: David Chinner Cc: Latchesar Ionkov, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel David Chinner wrote: > On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: >> David Chinner wrote: >>> On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: >>>> On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: >>>>> Hi, >>>>> >>>>> I agree that it is not feasible to add new system calls every time >>>>> somebody has a problem, and we don't take adding system calls lightly. >>>>> However, in this case we're talking about an entire *community* of people >>>>> (high-end computing), not just one or two people. Of course it may still >>>>> be the case that that community is not important enough to justify the >>>>> addition of system calls; that's obviously not my call to make! >>>> I have the feeling that openg stuff is rushed without looking into all >>>> solutions, that don't require changes to the current interface. >>> I also get the feeling that interfaces that already do this open-by-handle >>> stuff haven't been explored either. >>> >>> Does anyone here know about the XFS libhandle API? This has been around for >>> years and it does _exactly_ what these proposed syscalls are supposed to do >>> (and more). >>> >>> See: >>> >>> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle >>> >>> For the libhandle man page. Basically: >>> >>> openg == path_to_handle sutoc == open_by_handle >>> >>> And here for the userspace code: >>> >>> http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ >>> >>> Cheers, >>> >>> Dave. >> Thanks for pointing these out Dave. These are indeed along the same lines as >> the openg()/openfh() approach. >> >> One difference is that they appear to perform permission checking on the >> open_by_handle(), which means that the entire path needs to be encoded in >> the handle, and makes it difficult to eliminate the path traversal overhead >> on N open_by_handle() operations. > > open_by_handle() is checking the inode flags for things like > immutibility and whether the inode is writable to determine if the > open mode is valid given these flags. It's not actually checking > permissions. IOWs, open_by_handle() has the same overhead as NFS > filehandle to inode translation; i.e. no path traversal on open. > > Permission checks are done on the path_to_handle(), so in reality > only root or CAP_SYS_ADMIN users can currently use the > open_by_handle interface because of this lack of checking. Given > that our current users of this interface need root permissions to do > other things (data migration), this has never been an issue. > > This is an implementation detail - it is possible that file handle, > being opaque, could encode a UID/GID of the user that constructed > the handle and then allow any process with the same UID/GID to use > open_by_handle() on that handle. (I think hch has already pointed > this out.) > > Cheers, > > Dave. Thanks for the clarification Dave. So I take it that you would be interested in this type of functionality then? Regards, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 20:50 ` Rob Ross @ 2006-12-06 21:01 ` David Chinner 0 siblings, 0 replies; 124+ messages in thread From: David Chinner @ 2006-12-06 21:01 UTC (permalink / raw) To: Rob Ross Cc: David Chinner, Latchesar Ionkov, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On Wed, Dec 06, 2006 at 02:50:49PM -0600, Rob Ross wrote: > David Chinner wrote: > >On Wed, Dec 06, 2006 at 09:53:39AM -0600, Rob Ross wrote: > >>David Chinner wrote: > >>>Does anyone here know about the XFS libhandle API? This has been around > >>>for > >>>years and it does _exactly_ what these proposed syscalls are supposed to > >>>do > >>>(and more). > >>Thanks for pointing these out Dave. These are indeed along the same lines > >>as > >>the openg()/openfh() approach. > >> > >>One difference is that they appear to perform permission checking on the > >>open_by_handle(), which means that the entire path needs to be encoded in > >>the handle, and makes it difficult to eliminate the path traversal > >>overhead > >>on N open_by_handle() operations. > > > >open_by_handle() is checking the inode flags for things like > >immutibility and whether the inode is writable to determine if the > >open mode is valid given these flags. It's not actually checking > >permissions. IOWs, open_by_handle() has the same overhead as NFS > >filehandle to inode translation; i.e. no path traversal on open. > > > >Permission checks are done on the path_to_handle(), so in reality > >only root or CAP_SYS_ADMIN users can currently use the > >open_by_handle interface because of this lack of checking. Given > >that our current users of this interface need root permissions to do > >other things (data migration), this has never been an issue. > > > >This is an implementation detail - it is possible that file handle, > >being opaque, could encode a UID/GID of the user that constructed > >the handle and then allow any process with the same UID/GID to use > >open_by_handle() on that handle. (I think hch has already pointed > >this out.) > > Thanks for the clarification Dave. So I take it that you would be > interested in this type of functionality then? Not really - just trying to help by pointing out something no-one seemed to know about.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 15:53 ` openg and path_to_handle Rob Ross 2006-12-06 16:04 ` Matthew Wilcox 2006-12-06 20:40 ` David Chinner @ 2006-12-06 23:19 ` Latchesar Ionkov 2006-12-14 21:00 ` Rob Ross 2 siblings, 1 reply; 124+ messages in thread From: Latchesar Ionkov @ 2006-12-06 23:19 UTC (permalink / raw) To: Rob Ross Cc: David Chinner, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel On 12/6/06, Rob Ross <rross@mcs.anl.gov> wrote: > David Chinner wrote: > > On Tue, Dec 05, 2006 at 05:47:16PM +0100, Latchesar Ionkov wrote: > >> On 12/5/06, Rob Ross <rross@mcs.anl.gov> wrote: > >>> Hi, > >>> > >>> I agree that it is not feasible to add new system calls every time > >>> somebody has a problem, and we don't take adding system calls lightly. > >>> However, in this case we're talking about an entire *community* of > >>> people (high-end computing), not just one or two people. Of course it > >>> may still be the case that that community is not important enough to > >>> justify the addition of system calls; that's obviously not my call to make! > >> I have the feeling that openg stuff is rushed without looking into all > >> solutions, that don't require changes to the current interface. > > > > I also get the feeling that interfaces that already do this > > open-by-handle stuff haven't been explored either. > > > > Does anyone here know about the XFS libhandle API? This has been > > around for years and it does _exactly_ what these proposed syscalls > > are supposed to do (and more). > > > > See: > > > > http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle > > > > For the libhandle man page. Basically: > > > > openg == path_to_handle > > sutoc == open_by_handle > > > > And here for the userspace code: > > > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ > > > > Cheers, > > > > Dave. > > Thanks for pointing these out Dave. These are indeed along the same > lines as the openg()/openfh() approach. The open-by-handle makes a little more sense, because the "handle" is not opened, it only points to a resolved file. As I mentioned before, it doesn't make much sense to bundle in openg name resolution and file open. Still I am not convinced that we need two ways of "finding" files. Thanks, Lucho ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-06 23:19 ` Latchesar Ionkov @ 2006-12-14 21:00 ` Rob Ross 2006-12-14 21:20 ` Matthew Wilcox 0 siblings, 1 reply; 124+ messages in thread From: Rob Ross @ 2006-12-14 21:00 UTC (permalink / raw) To: Latchesar Ionkov Cc: David Chinner, Christoph Hellwig, Matthew Wilcox, Gary Grider, linux-fsdevel Latchesar Ionkov wrote: > On 12/6/06, Rob Ross <rross@mcs.anl.gov> wrote: >> David Chinner wrote: >> > >> > I also get the feeling that interfaces that already do this >> > open-by-handle stuff haven't been explored either. >> > >> > Does anyone here know about the XFS libhandle API? This has been >> > around for years and it does _exactly_ what these proposed syscalls >> > are supposed to do (and more). >> > >> > See: >> > >> > >> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man3/open_by_handle.3.html&srch=open_by_handle >> >> > >> > For the libhandle man page. Basically: >> > >> > openg == path_to_handle >> > sutoc == open_by_handle >> > >> > And here for the userspace code: >> > >> > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ >> > >> > Cheers, >> > >> > Dave. >> >> Thanks for pointing these out Dave. These are indeed along the same >> lines as the openg()/openfh() approach. > > The open-by-handle makes a little more sense, because the "handle" is > not opened, it only points to a resolved file. As I mentioned before, > it doesn't make much sense to bundle in openg name resolution and file > open. I don't think that I understand what you're saying here. The openg() call does not perform file open (not that that is necessarily even a first-class FS operation), it simply does the lookup. When we were naming these calls, from a POSIX consistency perspective it seemed best to keep the "open" nomenclature. That seems to be confusing to some. Perhaps we should rename the function "lookup" or something similar, to help keep from giving the wrong idea? There is a difference between the openg() and path_to_handle() approach in that we do permission checking at openg(), and that does have implications on how the handle might be stored and such. That's being discussed in a separate thread. Thanks, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-14 21:00 ` Rob Ross @ 2006-12-14 21:20 ` Matthew Wilcox 2006-12-14 23:02 ` Rob Ross 0 siblings, 1 reply; 124+ messages in thread From: Matthew Wilcox @ 2006-12-14 21:20 UTC (permalink / raw) To: Rob Ross Cc: Latchesar Ionkov, David Chinner, Christoph Hellwig, Gary Grider, linux-fsdevel On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote: > I don't think that I understand what you're saying here. The openg() > call does not perform file open (not that that is necessarily even a > first-class FS operation), it simply does the lookup. > > When we were naming these calls, from a POSIX consistency perspective it > seemed best to keep the "open" nomenclature. That seems to be confusing > to some. Perhaps we should rename the function "lookup" or something > similar, to help keep from giving the wrong idea? > > There is a difference between the openg() and path_to_handle() approach > in that we do permission checking at openg(), and that does have > implications on how the handle might be stored and such. That's being > discussed in a separate thread. I was just thinking about how one might implement this, when it struck me ... how much more efficient is a kernel implementation compared to: int openg(const char *path) { char *s; do { s = tempnam(FSROOT, ".sutoc"); link(path, s); } while (errno == EEXIST); mpi_broadcast(s); sleep(10); unlink(s); } and sutoc() becomes simply open(). Now you have a name that's quick to open (if a client has the filesystem mounted, it has a handle for the root already), has a defined lifespan, has minimal permission checking, and doesn't require standardisation. I suppose some cluster fs' might not support cross-directory links (AFS is one, I think), but then, no cluster fs's support openg/sutoc. If a filesystem's willing to add support for these handles, it shouldn't be too hard for them to treat files starting ".sutoc" specially, and as efficiently as adding the openg/sutoc concept. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: openg and path_to_handle 2006-12-14 21:20 ` Matthew Wilcox @ 2006-12-14 23:02 ` Rob Ross 0 siblings, 0 replies; 124+ messages in thread From: Rob Ross @ 2006-12-14 23:02 UTC (permalink / raw) To: Matthew Wilcox Cc: Latchesar Ionkov, David Chinner, Christoph Hellwig, Gary Grider, linux-fsdevel Matthew Wilcox wrote: > On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote: >> I don't think that I understand what you're saying here. The openg() >> call does not perform file open (not that that is necessarily even a >> first-class FS operation), it simply does the lookup. >> >> When we were naming these calls, from a POSIX consistency perspective it >> seemed best to keep the "open" nomenclature. That seems to be confusing >> to some. Perhaps we should rename the function "lookup" or something >> similar, to help keep from giving the wrong idea? >> >> There is a difference between the openg() and path_to_handle() approach >> in that we do permission checking at openg(), and that does have >> implications on how the handle might be stored and such. That's being >> discussed in a separate thread. > > I was just thinking about how one might implement this, when it struck > me ... how much more efficient is a kernel implementation compared to: > > int openg(const char *path) > { > char *s; > do { > s = tempnam(FSROOT, ".sutoc"); > link(path, s); > } while (errno == EEXIST); > > mpi_broadcast(s); > sleep(10); > unlink(s); > } > > and sutoc() becomes simply open(). Now you have a name that's quick to > open (if a client has the filesystem mounted, it has a handle for the > root already), has a defined lifespan, has minimal permission checking, > and doesn't require standardisation. > > I suppose some cluster fs' might not support cross-directory links > (AFS is one, I think), but then, no cluster fs's support openg/sutoc. Well at least one does :). > If a filesystem's willing to add support for these handles, it shouldn't > be too hard for them to treat files starting ".sutoc" specially, and as > efficiently as adding the openg/sutoc concept. Adding atomic reference count updating on file metadata so that we can have cross-directory links is not necessarily easier than supporting openg/openfh, and supporting cross-directory links precludes certain metadata organizations, such as the ones being used in Ceph (as I understand it). This also still forces all clients to read a directory and for N permission checking operations to be performed. I don't see what the FS could do to eliminate those operations given what you've described. Am I missing something? Also this looks too much like sillyrename, and that's hard to swallow... Regards, Rob ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: NFSv4/pNFS possible POSIX I/O API standards 2006-11-28 4:34 NFSv4/pNFS possible POSIX I/O API standards Gary Grider 2006-11-28 5:54 ` Christoph Hellwig @ 2006-11-28 15:08 ` Matthew Wilcox 1 sibling, 0 replies; 124+ messages in thread From: Matthew Wilcox @ 2006-11-28 15:08 UTC (permalink / raw) To: Gary Grider; +Cc: linux-fsdevel On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote: > >Things like > >openg() - on process opens a file and gets a key that is passed to > >lots of processes which > >use the key to get a handle (great for thousands of processes opening a > >file) I don't understand how this leads to a more efficient implementation. It seem to just add complexity. What does 'sutoc' mean anyway? > >readx/writex - scattergather readwrite - more appropriate and > >complete than the real time extended read/write These don't seem to be documented on the website. ^ permalink raw reply [flat|nested] 124+ messages in thread
end of thread, other threads:[~2006-12-18 5:05 UTC | newest]
Thread overview: 124+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-28 4:34 NFSv4/pNFS possible POSIX I/O API standards Gary Grider
2006-11-28 5:54 ` Christoph Hellwig
2006-11-28 10:54 ` Andreas Dilger
2006-11-28 11:28 ` Anton Altaparmakov
2006-11-28 20:17 ` Russell Cattelan
2006-11-28 23:28 ` Wendy Cheng
2006-11-29 9:12 ` Christoph Hellwig
2006-11-29 9:04 ` Christoph Hellwig
2006-11-29 9:14 ` Christoph Hellwig
2006-11-29 9:48 ` Andreas Dilger
2006-11-29 10:18 ` Anton Altaparmakov
2006-11-29 8:26 ` Brad Boyer
2006-11-30 9:25 ` Christoph Hellwig
2006-11-30 17:49 ` Sage Weil
2006-12-01 5:26 ` Trond Myklebust
2006-12-01 7:08 ` Sage Weil
2006-12-01 14:41 ` Trond Myklebust
2006-12-01 16:47 ` Sage Weil
2006-12-01 18:07 ` Trond Myklebust
2006-12-01 18:42 ` Sage Weil
2006-12-01 19:13 ` Trond Myklebust
2006-12-01 20:32 ` Sage Weil
2006-12-04 18:02 ` Peter Staubach
2006-12-05 23:20 ` readdirplus() as possible POSIX I/O API Sage Weil
2006-12-06 15:48 ` Peter Staubach
2006-12-03 1:57 ` NFSv4/pNFS possible POSIX I/O API standards Andreas Dilger
2006-12-03 7:34 ` Kari Hurtta
2006-12-03 1:52 ` Andreas Dilger
2006-12-03 16:10 ` Sage Weil
2006-12-04 7:32 ` Andreas Dilger
2006-12-04 15:15 ` Trond Myklebust
2006-12-05 0:59 ` Rob Ross
2006-12-05 4:44 ` Gary Grider
2006-12-05 10:05 ` Christoph Hellwig
2006-12-05 5:56 ` Trond Myklebust
2006-12-05 10:07 ` Christoph Hellwig
2006-12-05 14:20 ` Matthew Wilcox
2006-12-06 15:04 ` Rob Ross
2006-12-06 15:44 ` Matthew Wilcox
2006-12-06 16:15 ` Rob Ross
2006-12-05 14:55 ` Trond Myklebust
2006-12-05 22:11 ` Rob Ross
2006-12-05 23:24 ` Trond Myklebust
2006-12-06 16:42 ` Rob Ross
2006-12-06 12:22 ` Ragnar Kjørstad
2006-12-06 15:14 ` Trond Myklebust
2006-12-05 16:55 ` Latchesar Ionkov
2006-12-05 22:12 ` Christoph Hellwig
2006-12-06 23:12 ` Latchesar Ionkov
2006-12-06 23:33 ` Trond Myklebust
2006-12-05 21:50 ` Rob Ross
2006-12-05 22:05 ` Christoph Hellwig
2006-12-05 23:18 ` Sage Weil
2006-12-05 23:55 ` Ulrich Drepper
2006-12-06 10:06 ` Andreas Dilger
2006-12-06 17:19 ` Ulrich Drepper
2006-12-06 17:27 ` Rob Ross
2006-12-06 17:42 ` Ulrich Drepper
2006-12-06 18:01 ` Ragnar Kjørstad
2006-12-06 18:13 ` Ulrich Drepper
2006-12-17 14:41 ` Ragnar Kjørstad
2006-12-17 19:07 ` Ulrich Drepper
2006-12-17 19:38 ` Matthew Wilcox
2006-12-17 21:51 ` Ulrich Drepper
2006-12-18 2:57 ` Ragnar Kjørstad
2006-12-18 3:54 ` Gary Grider
2006-12-07 5:57 ` Andreas Dilger
2006-12-15 22:37 ` Ulrich Drepper
2006-12-16 18:13 ` Andreas Dilger
2006-12-16 19:08 ` Ulrich Drepper
2006-12-14 23:58 ` statlite() Rob Ross
2006-12-07 23:39 ` NFSv4/pNFS possible POSIX I/O API standards Nikita Danilov
2006-12-05 14:37 ` Peter Staubach
2006-12-05 10:26 ` readdirplus() as possible POSIX I/O API Andreas Dilger
2006-12-05 15:23 ` Trond Myklebust
2006-12-06 10:28 ` Andreas Dilger
2006-12-06 15:10 ` Trond Myklebust
2006-12-05 17:06 ` Latchesar Ionkov
2006-12-05 22:48 ` Rob Ross
2006-11-29 10:25 ` NFSv4/pNFS possible POSIX I/O API standards Steven Whitehouse
2006-11-30 12:29 ` Christoph Hellwig
2006-12-01 15:52 ` Ric Wheeler
2006-11-29 12:23 ` Matthew Wilcox
2006-11-29 12:35 ` Matthew Wilcox
2006-11-29 16:26 ` Gary Grider
2006-11-29 17:18 ` Christoph Hellwig
2006-11-29 12:39 ` Christoph Hellwig
2006-12-01 22:29 ` Rob Ross
2006-12-02 2:35 ` Latchesar Ionkov
2006-12-05 0:37 ` Rob Ross
2006-12-05 10:02 ` Christoph Hellwig
2006-12-05 16:47 ` Latchesar Ionkov
2006-12-05 17:01 ` Matthew Wilcox
[not found] ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com>
2006-12-05 17:10 ` Latchesar Ionkov
2006-12-05 17:39 ` Matthew Wilcox
2006-12-05 21:55 ` Rob Ross
2006-12-05 21:50 ` Peter Staubach
2006-12-05 21:44 ` Rob Ross
2006-12-06 11:01 ` openg Christoph Hellwig
2006-12-06 15:41 ` openg Trond Myklebust
2006-12-06 15:42 ` openg Rob Ross
2006-12-06 23:32 ` openg Christoph Hellwig
2006-12-14 23:36 ` openg Rob Ross
2006-12-06 23:25 ` Re: NFSv4/pNFS possible POSIX I/O API standards Latchesar Ionkov
2006-12-06 9:48 ` David Chinner
2006-12-06 15:53 ` openg and path_to_handle Rob Ross
2006-12-06 16:04 ` Matthew Wilcox
2006-12-06 16:20 ` Rob Ross
2006-12-06 20:57 ` David Chinner
2006-12-06 20:40 ` David Chinner
2006-12-06 20:50 ` Matthew Wilcox
2006-12-06 21:09 ` David Chinner
2006-12-06 22:09 ` Andreas Dilger
2006-12-06 22:17 ` Matthew Wilcox
2006-12-06 22:41 ` Andreas Dilger
2006-12-06 23:39 ` Christoph Hellwig
2006-12-14 22:52 ` Rob Ross
2006-12-06 20:50 ` Rob Ross
2006-12-06 21:01 ` David Chinner
2006-12-06 23:19 ` Latchesar Ionkov
2006-12-14 21:00 ` Rob Ross
2006-12-14 21:20 ` Matthew Wilcox
2006-12-14 23:02 ` Rob Ross
2006-11-28 15:08 ` NFSv4/pNFS possible POSIX I/O API standards Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).