* NFS4 crack @ 2005-09-18 10:21 Christoph Hellwig 2005-09-18 14:36 ` J. Bruce Fields 2005-09-20 18:37 ` Neil Brown 0 siblings, 2 replies; 41+ messages in thread From: Christoph Hellwig @ 2005-09-18 10:21 UTC (permalink / raw) To: akpm, neilb, andros, bfields; +Cc: linux-fsdevel I've recently turned on NFS4 server support accidentally, just to get error messages like: "NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist" To my horror I found out that this comes from kernel code, which messes with a hardcoded directory, completelyu ingoring any namespace or other uses issues. The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition. All this comes from "[PATCH] knfsd: nfsd4: initialize recovery directory", commit ID 190e4fbf96037e5e526ba3210f2bcc2a3b6fe964. Andrew, could you please back this out again, and NFS folks, please don't do stuff like that and hide your crackpipe somewhere. And please we really need someone sane review NFS patches I thinkg. (not cc'ed to the nfs list because of its stupid subsribers only policy) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-18 10:21 NFS4 crack Christoph Hellwig @ 2005-09-18 14:36 ` J. Bruce Fields 2005-09-19 10:35 ` Christoph Hellwig 2005-09-20 18:37 ` Neil Brown 1 sibling, 1 reply; 41+ messages in thread From: J. Bruce Fields @ 2005-09-18 14:36 UTC (permalink / raw) To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote: > I've recently turned on NFS4 server support accidentally, just to get > error messages like: > > "NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist" > > To my horror I found out that this comes from kernel code, which messes > with a hardcoded directory, completelyu ingoring any namespace or other > uses issues. As long as all nfsd threads are in the same namespace, I don't see any namespace issues. What am I missing? > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition. For example? --b. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-18 14:36 ` J. Bruce Fields @ 2005-09-19 10:35 ` Christoph Hellwig 2005-09-19 13:04 ` Anton Altaparmakov ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Christoph Hellwig @ 2005-09-19 10:35 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel On Sun, Sep 18, 2005 at 10:36:15AM -0400, J. Bruce Fields wrote: > On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote: > > I've recently turned on NFS4 server support accidentally, just to get > > error messages like: > > > > "NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist" > > > > To my horror I found out that this comes from kernel code, which messes > > with a hardcoded directory, completelyu ingoring any namespace or other > > uses issues. > > As long as all nfsd threads are in the same namespace, I don't see any > namespace issues. What am I missing? Namespaces issues above was meant as kernel can't assume namespace at all, not even thinking about multiple namespaces which makes it even more wrong. Who sais I allow the kernel to mess with /var/lib/nfs/v4recover? Who tells any userspace process is even in the same namespace as the nfs threads to create the directories? Kernel assuming any namespace is wrong and we don't do it anywhere. > > > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition. > > For example? - opens a directory O_RDWR which open_namei wouldn't even allow - tries to build dentry list from vfs_readdir callback, leading to deadlocks on filesystems that take the same lock from readdir and lookup - resets fsuid/fsgids without checks, synchronization or callouts into subsystems that care (security, keys, ptrace) - looks up /var/lib/nfs/v4recovery without ensuring it's a directory and probably a few more if one tried to look at it for more than five minutes. This is code that could be a third of the size if written in userpsace and actually had a chance to be correct there, nevermind the policy violations. Please remove the code and never ever try to sneak in something like that again. Thanks. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 10:35 ` Christoph Hellwig @ 2005-09-19 13:04 ` Anton Altaparmakov 2005-09-19 13:35 ` J. Bruce Fields 2005-09-19 20:31 ` J. Bruce Fields 2 siblings, 0 replies; 41+ messages in thread From: Anton Altaparmakov @ 2005-09-19 13:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: J. Bruce Fields, akpm, neilb, andros, linux-fsdevel On Mon, 2005-09-19 at 12:35 +0200, Christoph Hellwig wrote: > On Sun, Sep 18, 2005 at 10:36:15AM -0400, J. Bruce Fields wrote: > > On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote: [snip] > > > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition. > > > > For example? [snip] > - tries to build dentry list from vfs_readdir callback, leading to > deadlocks on filesystems that take the same lock from readdir > and lookup NFSv3 has always done this and yes it did lead to deadlock in ntfs so I had to work around it in ntfs to get it to work. I had to redesign how the locking worked which was really annoying thing to have to do. )-: Just pointing this out as it seems to be commonplace for nfs and nothing new... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 10:35 ` Christoph Hellwig 2005-09-19 13:04 ` Anton Altaparmakov @ 2005-09-19 13:35 ` J. Bruce Fields 2005-09-19 13:39 ` Christoph Hellwig 2005-09-19 20:31 ` J. Bruce Fields 2 siblings, 1 reply; 41+ messages in thread From: J. Bruce Fields @ 2005-09-19 13:35 UTC (permalink / raw) To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote: > Namespaces issues above was meant as kernel can't assume namespace at > all, not even thinking about multiple namespaces which makes it even > more wrong. Who sais I allow the kernel to mess with > /var/lib/nfs/v4recover? It's run-time configurable if you don't like the default. > Who tells any userspace process is even in the same namespace as the > nfs threads to create the directories? No userspace process is likely to care, except maybe for debugging purposes. This isn't a userspace<->kernel interface, it's just a way to store some information on disk so nfsd can find it again on next boot. > Kernel assuming any namespace is wrong and we don't do it anywhere. Well, nfsd does have some assumptions--mountd, exportfs, and nfsd all have to be in the same namespace, for example. (Or at least namespaces that are identical on exported paths.) > > > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition. > > > > For example? > > - opens a directory O_RDWR which open_namei wouldn't even allow > - tries to build dentry list from vfs_readdir callback, leading to > deadlocks on filesystems that take the same lock from readdir > and lookup > - resets fsuid/fsgids without checks, synchronization or callouts > into subsystems that care (security, keys, ptrace) > - looks up /var/lib/nfs/v4recovery without ensuring it's a directory > > and probably a few more if one tried to look at it for more than five > minutes. Are you sure about readdir? It looks to me like nfsd has done lookups there for some time--see, e.g., fs/nfsd/nfs3xdr.c:compose_entry_fh(). But I'll read through it again and check the other stuff you mention, thanks. --b. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 13:35 ` J. Bruce Fields @ 2005-09-19 13:39 ` Christoph Hellwig 2005-09-19 14:07 ` J. Bruce Fields 2005-09-19 17:13 ` Bryan Henderson 0 siblings, 2 replies; 41+ messages in thread From: Christoph Hellwig @ 2005-09-19 13:39 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel On Mon, Sep 19, 2005 at 09:35:28AM -0400, J. Bruce Fields wrote: > On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote: > > Namespaces issues above was meant as kernel can't assume namespace at > > all, not even thinking about multiple namespaces which makes it even > > more wrong. Who sais I allow the kernel to mess with > > /var/lib/nfs/v4recover? > > It's run-time configurable if you don't like the default. > > > Who tells any userspace process is even in the same namespace as the > > nfs threads to create the directories? > > No userspace process is likely to care, except maybe for debugging > purposes. This isn't a userspace<->kernel interface, it's just a way to > store some information on disk so nfsd can find it again on next boot. Again, FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL if that wasn't clear enough. You can't contiune enumerating the special cases in that i could actually work somehow, but that doesn't make the code any better. We have a strong policy to not have hardcoded filenames in the kernel (although a few week abstractions where we pass something very similar to a filename up to userspace to act on it), and we're not going to make an exception for NFSv4. Especially as this code would be much simpler in userspace as already mentioned. Directory handling is something that can't be done sanely in kernelspace. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 13:39 ` Christoph Hellwig @ 2005-09-19 14:07 ` J. Bruce Fields 2005-09-19 14:11 ` Christoph Hellwig 2005-09-19 17:13 ` Bryan Henderson 1 sibling, 1 reply; 41+ messages in thread From: J. Bruce Fields @ 2005-09-19 14:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel On Mon, Sep 19, 2005 at 03:39:21PM +0200, Christoph Hellwig wrote: > On Mon, Sep 19, 2005 at 09:35:28AM -0400, J. Bruce Fields wrote: > > On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote: > > > Namespaces issues above was meant as kernel can't assume namespace at > > > all, not even thinking about multiple namespaces which makes it even > > > more wrong. Who sais I allow the kernel to mess with > > > /var/lib/nfs/v4recover? > > > > It's run-time configurable if you don't like the default. > > > > > Who tells any userspace process is even in the same namespace as the > > > nfs threads to create the directories? > > > > No userspace process is likely to care, except maybe for debugging > > purposes. This isn't a userspace<->kernel interface, it's just a way to > > store some information on disk so nfsd can find it again on next boot. > > Again, > > FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL What problem does this create in this case? The "hardcoded" path is just a default for a value that can be modified at runtime. We could default to the empty string, I suppose, and make sure the path is set in the nfs init scripts. --b. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 14:07 ` J. Bruce Fields @ 2005-09-19 14:11 ` Christoph Hellwig 0 siblings, 0 replies; 41+ messages in thread From: Christoph Hellwig @ 2005-09-19 14:11 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel On Mon, Sep 19, 2005 at 10:07:15AM -0400, J. Bruce Fields wrote: > > > No userspace process is likely to care, except maybe for debugging > > > purposes. This isn't a userspace<->kernel interface, it's just a way to > > > store some information on disk so nfsd can find it again on next boot. > > > > Again, > > > > FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL > > What problem does this create in this case? > > The "hardcoded" path is just a default for a value that can be modified > at runtime. Umm, that's not the point at all. Pathnames are user policy and they shouldn't be used from the kernel even configurable. File access from kernelspace should be avoided whenver possible. NFSD is exception as it needs to access file as part of it's job, but that exception doesn't give it a wildcard to do random crap. And the other point is that the code is utter crap and could be done much better in userspace. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 13:39 ` Christoph Hellwig 2005-09-19 14:07 ` J. Bruce Fields @ 2005-09-19 17:13 ` Bryan Henderson 2005-09-19 17:16 ` Randy.Dunlap 2005-09-19 18:02 ` Christoph Hellwig 1 sibling, 2 replies; 41+ messages in thread From: Bryan Henderson @ 2005-09-19 17:13 UTC (permalink / raw) To: Christoph Hellwig Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig, linux-fsdevel, neilb >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL I think that's a great policy, but we can't be all that righteous about it because we don't do it today. I have a system that has highly customized file names, so I'm pretty familiar with all the world's hardcoded file names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and /sbin/modprobe. I could give /sbin/init and /bin/sh a pass because they're involved in bootstrapping, which always breaks a few rules. /sbin/modprobe is part of an application that is in the same boat as NFSv4: an application that was born to be user space but after considering the tradeoffs well, people decided to put them in the kernel anyway. When you do that, it shouldn't be too surprising that people drag some user space things like opening files by name with it. In this case, though, there's an easy enough fix: something in user space opens /var/lib/nfs/v4recover and passes the file handle to the kernel in a server configuration step. This would be like what process accounting and disk quota do. That addresses the use of file names in the kernel; it's not to say there aren't other problems with the present approach. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 17:13 ` Bryan Henderson @ 2005-09-19 17:16 ` Randy.Dunlap 2005-09-19 21:57 ` Bryan Henderson 2005-09-19 18:02 ` Christoph Hellwig 1 sibling, 1 reply; 41+ messages in thread From: Randy.Dunlap @ 2005-09-19 17:16 UTC (permalink / raw) To: Bryan Henderson Cc: Christoph Hellwig, akpm, andros, J. Bruce Fields, linux-fsdevel, neilb On Mon, 19 Sep 2005, Bryan Henderson wrote: > >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL > > I think that's a great policy, but we can't be all that righteous about it > because we don't do it today. I have a system that has highly customized > file names, so I'm pretty familiar with all the world's hardcoded file > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and > /sbin/modprobe. I could give /sbin/init and /bin/sh a pass because agreed. > they're involved in bootstrapping, which always breaks a few rules. > /sbin/modprobe is part of an application that is in the same boat as > NFSv4: an application that was born to be user space but after > considering the tradeoffs well, people decided to put them in the kernel > anyway. When you do that, it shouldn't be too surprising that people drag > some user space things like opening files by name with it. modprobe executable filename comes from here: rddunlap@vortex:/proc/sys/kernel> cat modprobe /sbin/modprobe > In this case, though, there's an easy enough fix: something in user space > opens /var/lib/nfs/v4recover and passes the file handle to the kernel in a > server configuration step. This would be like what process accounting and > disk quota do. > > That addresses the use of file names in the kernel; it's not to say there > aren't other problems with the present approach. -- ~Randy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 17:16 ` Randy.Dunlap @ 2005-09-19 21:57 ` Bryan Henderson 2005-09-19 22:11 ` Randy.Dunlap 0 siblings, 1 reply; 41+ messages in thread From: Bryan Henderson @ 2005-09-19 21:57 UTC (permalink / raw) To: Randy.Dunlap Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig, linux-fsdevel, neilb >>ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and >> /sbin/modprobe. >modprobe executable filename comes from here: >rddunlap@vortex:/proc/sys/kernel> cat modprobe >/sbin/modprobe Did you mean this as a contradiction? Because it isn't. /sbin/modprobe is hardcoded in the kernel as the default name of the module loader program. More importantly, even if you set the module loader program name externally, the kernel is still accessing that file by name, and that's less than desirable. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 21:57 ` Bryan Henderson @ 2005-09-19 22:11 ` Randy.Dunlap 2005-09-20 0:17 ` Bryan Henderson 0 siblings, 1 reply; 41+ messages in thread From: Randy.Dunlap @ 2005-09-19 22:11 UTC (permalink / raw) To: Bryan Henderson Cc: Randy.Dunlap, akpm, andros, J. Bruce Fields, Christoph Hellwig, linux-fsdevel, neilb On Mon, 19 Sep 2005, Bryan Henderson wrote: > >>ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and > >> /sbin/modprobe. > >modprobe executable filename comes from here: > >rddunlap@vortex:/proc/sys/kernel> cat modprobe > >/sbin/modprobe > > Did you mean this as a contradiction? Because it isn't. /sbin/modprobe > is hardcoded in the kernel as the default name of the module loader > program. More importantly, even if you set the module loader program name > externally, the kernel is still accessing that file by name, and that's > less than desirable. Yes, there's a hard-coded default value for the module loader. That doesn't sound bad to me. So if I choose to use /sbin/bhloader (i.e., I set /proc/sys/kernel/modprobe to "/sbin/bhloader"), what's the problem? How should the kernel access that file? And the kernel doesn't really access that file per se. It just calls call_usermodehelper() to start a task and modprobe_path is one of the parameters there. Thanks, -- ~Randy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 22:11 ` Randy.Dunlap @ 2005-09-20 0:17 ` Bryan Henderson 0 siblings, 0 replies; 41+ messages in thread From: Bryan Henderson @ 2005-09-20 0:17 UTC (permalink / raw) To: Randy.Dunlap Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig, linux-fsdevel, neilb, Randy.Dunlap >Yes, there's a hard-coded default value for the module loader. >That doesn't sound bad to me. > >So if I choose to use /sbin/bhloader (i.e., I set >/proc/sys/kernel/modprobe to "/sbin/bhloader"), >what's the problem? How should the kernel access that file? Remember that the main point of this subthread is that the situation complained of in the NFSv4 kernel code already exists, with the question of whether that is an OK situation considered separately. The NFSv4 code also has a hardcoded default file name (well, filesystem object name anyway) that can be overridden by the user, but the kernel code identifies the filesystem object by name when it's time to use it, in any case. Christoph points out that the practical ramifications are less with /sbin/modprobe because it's an executable file and tends to exist always, but at a more basic level, the two are analogous. But I do find the situation objectionable (in both cases), because I prefer layering. I prefer that the guts of the kernel know nothing about the file name space, which means "/sbin/modprobe" can't be special in any way, and the kernel can't request any service by file name. If it were up to me, the kernel would inform a user space process that a module needs to be loaded and it would be up to that process to decide from what file to get the loader program (maybe based on a config file in /etc). The kernel would never know the name of that file. I believe it used to be that way, so apparently someone else had different priorities. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 17:13 ` Bryan Henderson 2005-09-19 17:16 ` Randy.Dunlap @ 2005-09-19 18:02 ` Christoph Hellwig 2005-09-19 18:53 ` William A.(Andy) Adamson 2005-09-19 19:01 ` J. Bruce Fields 1 sibling, 2 replies; 41+ messages in thread From: Christoph Hellwig @ 2005-09-19 18:02 UTC (permalink / raw) To: Bryan Henderson Cc: Christoph Hellwig, akpm, andros, J. Bruce Fields, linux-fsdevel, neilb On Mon, Sep 19, 2005 at 10:13:49AM -0700, Bryan Henderson wrote: > >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL > > I think that's a great policy, but we can't be all that righteous about it > because we don't do it today. I have a system that has highly customized > file names, so I'm pretty familiar with all the world's hardcoded file > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and > /sbin/modprobe. They are not nice, but quite a bit different, as we are trying to execute them, which can't have bad side-effects in case they don't exist. What nfsd does is expecting a directory to be present on which it can do various operations. That's much worse then trying to execute or even read from a file. Besides that all this directory handling really belongs into userland as pointed out _three times_ now. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 18:02 ` Christoph Hellwig @ 2005-09-19 18:53 ` William A.(Andy) Adamson 2005-09-19 18:59 ` Christoph Hellwig 2005-09-19 22:04 ` Bryan Henderson 2005-09-19 19:01 ` J. Bruce Fields 1 sibling, 2 replies; 41+ messages in thread From: William A.(Andy) Adamson @ 2005-09-19 18:53 UTC (permalink / raw) To: Christoph Hellwig Cc: Bryan Henderson, Christoph Hellwig, akpm, andros, J. Bruce Fields, linux-fsdevel, neilb > On Mon, Sep 19, 2005 at 10:13:49AM -0700, Bryan Henderson wrote: > > >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL > > > > I think that's a great policy, but we can't be all that righteous about it > > because we don't do it today. I have a system that has highly customized > > file names, so I'm pretty familiar with all the world's hardcoded file > > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and > > /sbin/modprobe. > > They are not nice, but quite a bit different, as we are trying to execute > them, which can't have bad side-effects in case they don't exist. > > What nfsd does is expecting a directory to be present on which it can > do various operations. That's much worse then trying to execute or > even read from a file. what we could do is not provide a default, and turn off reboot recovery (no grace period) if the recovery directory is not configured. > Besides that all this directory handling really > belongs into userland as pointed out _three times_ now. We were anticipating placing data into files in the recovery directory at each OPEN and each LOCK call in order to limit the scope of the NFSv4 grace period to the state that was actually in use prior to the reboot. We therefore went ahead with a kernel implementation for performance reasons. -->Andy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 18:53 ` William A.(Andy) Adamson @ 2005-09-19 18:59 ` Christoph Hellwig 2005-09-19 22:04 ` Bryan Henderson 1 sibling, 0 replies; 41+ messages in thread From: Christoph Hellwig @ 2005-09-19 18:59 UTC (permalink / raw) To: William A.(Andy) Adamson Cc: Christoph Hellwig, Bryan Henderson, Christoph Hellwig, akpm, J. Bruce Fields, linux-fsdevel, neilb On Mon, Sep 19, 2005 at 02:53:36PM -0400, William A.(Andy) Adamson wrote: > > They are not nice, but quite a bit different, as we are trying to execute > > them, which can't have bad side-effects in case they don't exist. > > > > What nfsd does is expecting a directory to be present on which it can > > do various operations. That's much worse then trying to execute or > > even read from a file. > > what we could do is not provide a default, and turn off reboot recovery (no > grace period) if the recovery directory is not configured. > > > Besides that all this directory handling really > > belongs into userland as pointed out _three times_ now. > > We were anticipating placing data into files in the recovery directory at each > OPEN and each LOCK call in order to limit the scope of the NFSv4 grace period > to the state that was actually in use prior to the reboot. We therefore went > ahead with a kernel implementation for performance reasons. Then pass in a file descriptor for the each client. Doing all this directory operations is not an option - if you need to do actual file I/O to them that's less of an problem. And please discuss such design issues here on -fsdevel. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 18:53 ` William A.(Andy) Adamson 2005-09-19 18:59 ` Christoph Hellwig @ 2005-09-19 22:04 ` Bryan Henderson 1 sibling, 0 replies; 41+ messages in thread From: Bryan Henderson @ 2005-09-19 22:04 UTC (permalink / raw) To: William A.(Andy) Adamson Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig, Christoph Hellwig, linux-fsdevel, neilb >what we could do is not provide a default, and turn off reboot recovery (no >grace period) if the recovery directory is not configured. It sounds like you're still talking about configuring a file name into the kernel -- just doing it at run time instead of build time. While better, I'd really rather not see the kernel access files by names. Giving the kernel file descriptors would be better. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 18:02 ` Christoph Hellwig 2005-09-19 18:53 ` William A.(Andy) Adamson @ 2005-09-19 19:01 ` J. Bruce Fields 2005-09-19 19:05 ` Christoph Hellwig 1 sibling, 1 reply; 41+ messages in thread From: J. Bruce Fields @ 2005-09-19 19:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Bryan Henderson, Christoph Hellwig, akpm, andros, linux-fsdevel, neilb On Mon, Sep 19, 2005 at 07:02:40PM +0100, Christoph Hellwig wrote: > On Mon, Sep 19, 2005 at 10:13:49AM -0700, Bryan Henderson wrote: > > >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL > > > > I think that's a great policy, but we can't be all that righteous about it > > because we don't do it today. I have a system that has highly customized > > file names, so I'm pretty familiar with all the world's hardcoded file > > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and > > /sbin/modprobe. > > They are not nice, but quite a bit different, as we are trying to execute > them, which can't have bad side-effects in case they don't exist. What bad side-effects are you thinking of here? --b. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 19:01 ` J. Bruce Fields @ 2005-09-19 19:05 ` Christoph Hellwig 0 siblings, 0 replies; 41+ messages in thread From: Christoph Hellwig @ 2005-09-19 19:05 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Bryan Henderson, Christoph Hellwig, akpm, andros, linux-fsdevel, neilb On Mon, Sep 19, 2005 at 03:01:17PM -0400, J. Bruce Fields wrote: > > > because we don't do it today. I have a system that has highly customized > > > file names, so I'm pretty familiar with all the world's hardcoded file > > > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and > > > /sbin/modprobe. > > > > They are not nice, but quite a bit different, as we are trying to execute > > them, which can't have bad side-effects in case they don't exist. > > What bad side-effects are you thinking of here? Sorry s/don't exist/& as expected/ think of your directory as symlink to something important, you'll just mess with it confuse nfsd, whipe parts out. All kinds of nasty things can happen. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 10:35 ` Christoph Hellwig 2005-09-19 13:04 ` Anton Altaparmakov 2005-09-19 13:35 ` J. Bruce Fields @ 2005-09-19 20:31 ` J. Bruce Fields 2005-09-20 12:49 ` Greg KH 2 siblings, 1 reply; 41+ messages in thread From: J. Bruce Fields @ 2005-09-19 20:31 UTC (permalink / raw) To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote: > On Sun, Sep 18, 2005 at 10:36:15AM -0400, J. Bruce Fields wrote: > > On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote: > > > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition. > > > > For example? > > - opens a directory O_RDWR which open_namei wouldn't even allow OK, thanks, fixed locally. > - tries to build dentry list from vfs_readdir callback, leading to > deadlocks on filesystems that take the same lock from readdir > and lookup So it appears that nfsd has long made the requirement that filesystems not do this. Does this need to be documented somehwere? > - resets fsuid/fsgids without checks, synchronization or callouts > into subsystems that care (security, keys, ptrace) I think the model here was nfsd_setuser(), which does essentially the same thing. Is this an nfsd bug? > - looks up /var/lib/nfs/v4recovery without ensuring it's a directory Oops, thanks. > and probably a few more if one tried to look at it for more than five > minutes. This is code that could be a third of the size if written > in userpsace and actually had a chance to be correct there, nevermind > the policy violations. That's a couple good bugs identified, thanks, but I'm not convinced that this would be significantly simpler from userspace. We'd need two pieces of user<->kernel interface: 1. An upcall to userspace to tell it about new client state. We also need to be able to wait for userspace to commit something to disk, as the information has to survive a reboot. 2. A way for userspace to dump recorded state to the kernel the next time nfsd starts up. Number 1 could be done with something like hotplug, I guess. (It can be told to wait for the userspace helper to exit, right?) Another file in the nfsd filesystem might work for the second interface. We also considered accomplishing number 1 by appending records to a log file. Userspace could hand in a file descriptor to use for this purpose. We'd still need the second interface. --b. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-19 20:31 ` J. Bruce Fields @ 2005-09-20 12:49 ` Greg KH 2005-09-20 15:10 ` William A.(Andy) Adamson 0 siblings, 1 reply; 41+ messages in thread From: Greg KH @ 2005-09-20 12:49 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel On Mon, Sep 19, 2005 at 04:31:43PM -0400, J. Bruce Fields wrote: > We'd need two pieces of user<->kernel interface: > > 1. An upcall to userspace to tell it about new client state. We > also need to be able to wait for userspace to commit something > to disk, as the information has to survive a reboot. > 2. A way for userspace to dump recorded state to the kernel the > next time nfsd starts up. > > Number 1 could be done with something like hotplug, I guess. (It can be > told to wait for the userspace helper to exit, right?) Well, calling /sbin/hotplug itself can't be told to wait, especially as that value is being set to NULL by most distros these days, as they are using netlink instead. Good luck, greg k-h ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-20 12:49 ` Greg KH @ 2005-09-20 15:10 ` William A.(Andy) Adamson 0 siblings, 0 replies; 41+ messages in thread From: William A.(Andy) Adamson @ 2005-09-20 15:10 UTC (permalink / raw) To: Greg KH Cc: J. Bruce Fields, Christoph Hellwig, akpm, neilb, andros, linux-fsdevel, andros > On Mon, Sep 19, 2005 at 04:31:43PM -0400, J. Bruce Fields wrote: > > We'd need two pieces of user<->kernel interface: > > > > 1. An upcall to userspace to tell it about new client state. We > > also need to be able to wait for userspace to commit something > > to disk, as the information has to survive a reboot. > > 2. A way for userspace to dump recorded state to the kernel the > > next time nfsd starts up. > > > > Number 1 could be done with something like hotplug, I guess. (It can be > > told to wait for the userspace helper to exit, right?) > > Well, calling /sbin/hotplug itself can't be told to wait, especially as > that value is being set to NULL by most distros these days, as they are > using netlink instead. > call_usermodehelper_keys() with the wait status is what we are thinking of using for #1. note that the keyring code which uses call_usermodehelper_keys also hard codes an executable name. security/keys/request_key.c: /* set up the argument list */ i = 0; argv[i++] = "/sbin/request-key"; argv[i++] = (char *) op; -->Andy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-18 10:21 NFS4 crack Christoph Hellwig 2005-09-18 14:36 ` J. Bruce Fields @ 2005-09-20 18:37 ` Neil Brown 2005-09-21 7:44 ` Andrew Morton ` (3 more replies) 1 sibling, 4 replies; 41+ messages in thread From: Neil Brown @ 2005-09-20 18:37 UTC (permalink / raw) To: Christoph Hellwig; +Cc: akpm, andros, bfields, linux-fsdevel, Olaf Kirch On Sunday September 18, hch@lst.de wrote: > I've recently turned on NFS4 server support accidentally, just to get > error messages like: > > "NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist" > > To my horror I found out that this comes from kernel code, which messes > with a hardcoded directory, completelyu ingoring any namespace or other > uses issues. The fs handling in fs/nfs/nfs4recovery.c is rather broken > in addition. > > All this comes from "[PATCH] knfsd: nfsd4: initialize recovery directory", > commit ID 190e4fbf96037e5e526ba3210f2bcc2a3b6fe964. I confess that I am having trouble finding a convincing basis for your position, which is why I allowed the patch through in the first place (despite not particularly liking it). My problem is: where do you draw the line? It should be noted first that nfsd is unlike most (all?) other kernel code. It is an application that is running in-kernel. It is a consumer of kernel services, and provides no (significant) services to user-space, or to other parts of the kernel. Now, this in-kernel-application needs to store stable application-specific data somewhere. May it: 1/ open a directory and create files in it and write to them 2/ open a directory and create files provided that the name of the directory is given by userspace 3/ create files in a directory that was created by userspace and given to the knfsd application as a filedescriptor 4/ write data to files which were created and opened by used-space based on filenames provided by knfsd (hostnames or equivalents in this case). 5/ pass the data to userspace and let it worry completely. 6/ sorry, you cannot have application-specific state. I'm sure you will see a progression here. I ask again: "where do you draw the line?" You seem to rule out 1, and probably 2, and possibly 3 based on other comments in the thread. It cannot set a rational place to draw the line other than before-1 or after-4. i.e. if you allow 4, you may as well allow 1 too. If you have give a clear argument for some particular place to draw the line, I'd love to hear it, together with your justification. While considering it, you might also like to consider: - is it ok for knfsd to bind to port 2049 ? - is it ok if userspace tells it the number '2049' ? - does user-space have to create/bind the socket and pass it to knfsd? - does user-space have to receive the packets and pass them to knfsd? (ok, that one is really silly). and "why?" The reality is that NFS service is an application. Currently parts of it are in-kernel (nfsd, lockd) and parts are in user-space (portmap, statd(*), mountd). There are two positions on what-goes-where that make sense to me: 1- pragmatism: put code where it works best. I believe that the current code fits pragmatism quite well (modulo bugs). 2- "rightness": If you want to argue from a what-belongs-where perspective, you have to say that knfsd doesn't belong in the kernel at all. The kernel should just supply the core services (e.g. file-handle <-> fd mapping) and let userspace do the rest. Were I starting to write knfsd today, I would pick 2. Given where we actually are today, I pick 1. > we really need someone sane review NFS patches I thinkg. yes please.. pretty please :-) > > (not cc'ed to the nfs list because of its stupid subsribers only > policy) Sad, isn't it. Both nfs@lists.sourceforge.net and nfsv4@linux-nfs.org are like that, and nfs-devel@linux.kernel.org died long ago. :-( NeilBrown (*) There are patches in existence which move statd implementation into the kernel. The final conclusion here may well affect those patches, so I hope Olaf has been listening in.... ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-20 18:37 ` Neil Brown @ 2005-09-21 7:44 ` Andrew Morton 2005-09-22 20:58 ` William A.(Andy) Adamson 2005-09-21 13:41 ` Trond Myklebust ` (2 subsequent siblings) 3 siblings, 1 reply; 41+ messages in thread From: Andrew Morton @ 2005-09-21 7:44 UTC (permalink / raw) To: Neil Brown; +Cc: hch, andros, bfields, linux-fsdevel, okir Neil Brown <neilb@suse.de> wrote: > > Now, this in-kernel-application needs to store stable > application-specific data somewhere. May it: > > 1/ open a directory and create files in it and write to them > 2/ open a directory and create files provided that the name of the > directory is given by userspace > 3/ create files in a directory that was created by userspace and > given to the knfsd application as a filedescriptor > 4/ write data to files which were created and opened by used-space > based on filenames provided by knfsd (hostnames or equivalents in > this case). > 5/ pass the data to userspace and let it worry completely. > 6/ sorry, you cannot have application-specific state. > 5/ sounds good. There are numerous options, newly including connector and configfs. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-21 7:44 ` Andrew Morton @ 2005-09-22 20:58 ` William A.(Andy) Adamson 0 siblings, 0 replies; 41+ messages in thread From: William A.(Andy) Adamson @ 2005-09-22 20:58 UTC (permalink / raw) To: Andrew Morton Cc: Neil Brown, hch, andros, bfields, linux-fsdevel, okir, andros > Neil Brown <neilb@suse.de> wrote: > > > > Now, this in-kernel-application needs to store stable > > application-specific data somewhere. May it: > > > > 1/ open a directory and create files in it and write to them > > 2/ open a directory and create files provided that the name of the > > directory is given by userspace > > 3/ create files in a directory that was created by userspace and > > given to the knfsd application as a filedescriptor > > 4/ write data to files which were created and opened by used-space > > based on filenames provided by knfsd (hostnames or equivalents in > > this case). > > 5/ pass the data to userspace and let it worry completely. > > 6/ sorry, you cannot have application-specific state. > > > > 5/ sounds good. There are numerous options, newly including connector and > configfs. alright. i'll look into a user space solution. -->Andy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-20 18:37 ` Neil Brown 2005-09-21 7:44 ` Andrew Morton @ 2005-09-21 13:41 ` Trond Myklebust 2005-09-21 14:40 ` J. Bruce Fields 2005-09-22 16:28 ` Bryan Henderson 3 siblings, 0 replies; 41+ messages in thread From: Trond Myklebust @ 2005-09-21 13:41 UTC (permalink / raw) To: Neil Brown Cc: Christoph Hellwig, akpm, andros, bfields, linux-fsdevel, Olaf Kirch on den 21.09.2005 Klokka 04:37 (+1000) skreiv Neil Brown: > Sad, isn't it. Both nfs@lists.sourceforge.net and nfsv4@linux-nfs.org > are like that, and nfs-devel@linux.kernel.org died long ago. :-( I can set up an unmoderated NFS list on linux-nfs.org if there is a demand for it. I could also open up nfsv4@linux-nfs.org if that is desirable. However my preference would be to see the admins for nfs@lists.sourceforge.net (whoever the hell is in that select list these days) open that list up. There should be no need to keep duplicating all these mailing lists, and nfs@lists is currently supposed to be the generic NFS list. Cheers, Trond ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-20 18:37 ` Neil Brown 2005-09-21 7:44 ` Andrew Morton 2005-09-21 13:41 ` Trond Myklebust @ 2005-09-21 14:40 ` J. Bruce Fields 2005-09-22 16:28 ` Bryan Henderson 3 siblings, 0 replies; 41+ messages in thread From: J. Bruce Fields @ 2005-09-21 14:40 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, akpm, andros, linux-fsdevel, Olaf Kirch On Wed, Sep 21, 2005 at 04:37:36AM +1000, Neil Brown wrote: > On Sunday September 18, hch@lst.de wrote: > > (not cc'ed to the nfs list because of its stupid subsribers only > > policy) > > Sad, isn't it. Both nfs@lists.sourceforge.net and nfsv4@linux-nfs.org > are like that, and nfs-devel@linux.kernel.org died long ago. :-( The nfsv4@linux-nfs.org policy is to defer non-subscriber email for moderation. There are a couple moderators, and we should usually be able to moderate (and whitelist) anyone within a few hours. But we could open it up more. It'd also be nice to open up the sourceforge list some more--I think it has the same sort of policty but the delays occasionally seem to be measured in weeks. --b. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-20 18:37 ` Neil Brown ` (2 preceding siblings ...) 2005-09-21 14:40 ` J. Bruce Fields @ 2005-09-22 16:28 ` Bryan Henderson 2005-09-22 16:52 ` Trond Myklebust 3 siblings, 1 reply; 41+ messages in thread From: Bryan Henderson @ 2005-09-22 16:28 UTC (permalink / raw) To: Neil Brown Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch >2- "rightness": If you want to argue from a what-belongs-where > perspective, you have to say that knfsd doesn't belong in the > kernel at all. The kernel should just supply the core services > (e.g. file-handle <-> fd mapping) and let userspace do the rest. This is the real reason that it is so hard to draw that line, and why fairly natural code in knfsd makes kernel programmers recoil in horror. Maybe you could remind everyone why knfsd is in the kernel. If it's just speed, what if anything would have to change in the structure of a system to make it work as fast in user space? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 16:28 ` Bryan Henderson @ 2005-09-22 16:52 ` Trond Myklebust 2005-09-22 17:38 ` Peter Staubach 0 siblings, 1 reply; 41+ messages in thread From: Trond Myklebust @ 2005-09-22 16:52 UTC (permalink / raw) To: Bryan Henderson Cc: Neil Brown, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch to den 22.09.2005 Klokka 09:28 (-0700) skreiv Bryan Henderson: > Maybe you could remind everyone why knfsd is in the kernel. If it's just > speed, what if anything would have to change in the structure of a system > to make it work as fast in user space? The main reason for keeping (part) of the NFS server in the kernel is not speed, but coping with races. In particular note that all NFS operations on files take an opaque filehandle argument rather than a path. For instance, the operation CREATE takes a filehandle argument in order to determine the path of the directory in which to create the file, then a string argument to determine the filename. The set of filesystem-supplied helper function that converts a filehandle into a dentry means that knfsd can do this safely without danger of racing with rename() calls, unlink(),... Trying to do the same thing in userland would have to involve first converting the filehandle into a pathname, and then calling a POSIX function using that pathname which is obviously very race prone. Cheers, Trond ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 16:52 ` Trond Myklebust @ 2005-09-22 17:38 ` Peter Staubach 2005-09-22 17:52 ` Trond Myklebust 2005-09-22 21:19 ` Bryan Henderson 0 siblings, 2 replies; 41+ messages in thread From: Peter Staubach @ 2005-09-22 17:38 UTC (permalink / raw) To: Trond Myklebust Cc: Bryan Henderson, Neil Brown, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch Trond Myklebust wrote: >to den 22.09.2005 Klokka 09:28 (-0700) skreiv Bryan Henderson: > > > >>Maybe you could remind everyone why knfsd is in the kernel. If it's just >>speed, what if anything would have to change in the structure of a system >>to make it work as fast in user space? >> >> > >The main reason for keeping (part) of the NFS server in the kernel is >not speed, but coping with races. > >In particular note that all NFS operations on files take an opaque >filehandle argument rather than a path. For instance, the operation >CREATE takes a filehandle argument in order to determine the path of the >directory in which to create the file, then a string argument to >determine the filename. >The set of filesystem-supplied helper function that converts a >filehandle into a dentry means that knfsd can do this safely without >danger of racing with rename() calls, unlink(),... >Trying to do the same thing in userland would have to involve first >converting the filehandle into a pathname, and then calling a POSIX >function using that pathname which is obviously very race prone. > It seems to me that a "system call" could implemented which would allow a file to be "opened" via the file handle. But then, we would be back to the speed argument. Switching in and out of the kernel requires time and data copies, both of which are not good and would kill any possibilities of making the Linux NFS server competitive. ps ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 17:38 ` Peter Staubach @ 2005-09-22 17:52 ` Trond Myklebust 2005-09-22 18:07 ` Peter Staubach 2005-09-22 21:19 ` Bryan Henderson 1 sibling, 1 reply; 41+ messages in thread From: Trond Myklebust @ 2005-09-22 17:52 UTC (permalink / raw) To: Peter Staubach Cc: Bryan Henderson, Neil Brown, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach: > It seems to me that a "system call" could implemented which would allow > a file to be "opened" via the file handle. Sure, but open alone isn't sufficient. A lot (most?) of the operations involving filehandles are acting on directories. Imagine if someone renames a directory on the server while the NFS server is in the middle of an unlink() operation, for instance. Cheers, Trond ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 17:52 ` Trond Myklebust @ 2005-09-22 18:07 ` Peter Staubach 2005-09-22 21:08 ` Bryan Henderson ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Peter Staubach @ 2005-09-22 18:07 UTC (permalink / raw) To: Trond Myklebust Cc: Bryan Henderson, Neil Brown, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch Trond Myklebust wrote: >to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach: > > >>It seems to me that a "system call" could implemented which would allow >>a file to be "opened" via the file handle. >> >> > >Sure, but open alone isn't sufficient. A lot (most?) of the operations >involving filehandles are acting on directories. > >Imagine if someone renames a directory on the server while the NFS >server is in the middle of an unlink() operation, for instance. > Yup, although you could resolve that by introducing a whole set of operations which work off of file descriptors, instead of pathnames. Then, inside of the kernel, to do the real operation, the file descriptor would get turned back into the inode, but without the pathname look portion. Things like funlink(fd, name), fmkdir(fd, name), frmdir(fd, name), etc. Other operating systems have implemented at least a subset of these sorts of calls and it gets ugly quickly. The NFS server also has to do its own special checking and sometimes this checking conflicts with the checking done in the normal "from user mode" path. --- Without a great deal of work and many new interfaces, there is no way to get something like the NFS server to run correctly outside of the kernel address space. There are correctness issues such as Trond has pointed out and there are performance issues as well. Is there inherent problem with the NFS server being implemented as an alternate VFS layer in the kernel, with its own requirements? Or is this an academic problem? Unless we are willing to consider moving to a micro-kernel approach, ala Mach, then we are going to need to consider the requirements of kernel based applications in addition to user level applications. Thanx... ps ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 18:07 ` Peter Staubach @ 2005-09-22 21:08 ` Bryan Henderson 2005-09-23 12:17 ` Peter Staubach 2005-09-22 21:48 ` NFS4 crack Nicholas Miell 2005-09-22 22:50 ` Greg Banks 2 siblings, 1 reply; 41+ messages in thread From: Bryan Henderson @ 2005-09-22 21:08 UTC (permalink / raw) To: Peter Staubach Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Neil Brown, Olaf Kirch, Trond Myklebust >Yup, although you could resolve that by introducing a whole set of >operations which work off of file descriptors, instead of pathnames. To do the whole job, what you need is a set of system calls that work off NFS file handles instead of path names, and you may even need a different kind of open state, ergo file descriptor, and these system calls would require special privilege. And that's not so crazy -- Linux/Unix is long overdue for a more advanced system call file interface than POSIX. NFS needs it; Windows compatibility (e.g. Samba) needs it; backup, HSM, and storage management need it. It just might not be practical in the near term. It would be good to understand whether the NFS server is in the kernel for basic structural reasons or just because we're too lazy to invent this new system call interface, because that sheds light on how a normally user space problem like storing persistent application data for NFSv4 should be approached. Do we need a new kernel paradigm that admits file and filename use within the kernel, or do we hold our nose and say, "what's one more hack on top of an existing one?" -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 21:08 ` Bryan Henderson @ 2005-09-23 12:17 ` Peter Staubach 2005-09-23 20:50 ` Bryan Henderson 0 siblings, 1 reply; 41+ messages in thread From: Peter Staubach @ 2005-09-23 12:17 UTC (permalink / raw) To: Bryan Henderson Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Neil Brown, Olaf Kirch, Trond Myklebust Bryan Henderson wrote: > >It would be good to understand whether the NFS server is in the kernel for >basic structural reasons or just because we're too lazy to invent this new >system call interface, because that sheds light on how a normally user >space problem like storing persistent application data for NFSv4 should be >approached. Do we need a new kernel paradigm that admits file and >filename use within the kernel, or do we hold our nose and say, "what's >one more hack on top of an existing one?" > The NFS server is in the kernel for basic structural reasons, but also for performance reasons. I would be happy to hear and/or read a proposal on how to get packets containing requests and/or responses in and out of the kernel without copying them. Inside of the kernel, both can be handled with no copies. It isn't that we are too lazy, by the way. This issue gets looked into every so often. The set of system calls can be determined pretty quickly and implementing them, while tricky in spots, can be done. However, the ugliness of the implementation soon starts to overwhelm the cleanliness of the design. -- I would even be happy with seeing a user mode local disk based file system which performed as well as a kernel mode file system. That seems easier to to me because then there wouldn't be any of those sticky networking issues to worry about. When we get this, then we can consider the value of moving something like NFS too. Thanx... ps ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-23 12:17 ` Peter Staubach @ 2005-09-23 20:50 ` Bryan Henderson 2005-09-23 21:02 ` NFS4 crack\ Al Viro 0 siblings, 1 reply; 41+ messages in thread From: Bryan Henderson @ 2005-09-23 20:50 UTC (permalink / raw) To: Peter Staubach Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Neil Brown, Olaf Kirch, Trond Myklebust >The NFS server is in the kernel for basic structural reasons, but also for >performance reasons. The two are orthogonal. The faster kernel performance could be either because of the basic structure of the system (to go that fast in user space would be impossible or require ugly interfaces) or just convenience (to go that fast in user space, someone would have to add some interfaces). >I would be happy to hear and/or read a proposal on >how to get packets containing requests and/or responses in and out of the >kernel without copying them. Inside of the kernel, both can be handled >with no copies. Proposals are beyond the scope of this conversation, since we're not trying to design (or even argue for) user space nfsd but rather to understand the dilemma of kernel code needing to do something (access files by name) that we've always considered a non-kernel activity. But if your point is that a decent proposal doesn't exist because zero-copy network communication fundamentally has to be in the kernel, then what about zero copy disk file access? (direct I/O, raw device, mmap). The basic facility seems to be there. And if you really can't do network communication as fast in user space as in the kernel, should we expect other network applications with a high speed requirement to go in the kernel too? It seems to me that the VFS interface is a lot better reason for nfsd to be special and be an in-kernel application. But so far, I haven't seen any argument that whatever nfsd needs out of VFS couldn't cleanly be added to a system call interface. You say people have actually looked into it and found that it has to be ugly; I just don't yet see why myself. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack\ 2005-09-23 20:50 ` Bryan Henderson @ 2005-09-23 21:02 ` Al Viro 2005-09-26 16:29 ` Bryan Henderson 0 siblings, 1 reply; 41+ messages in thread From: Al Viro @ 2005-09-23 21:02 UTC (permalink / raw) To: Bryan Henderson Cc: Peter Staubach, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Neil Brown, Olaf Kirch, Trond Myklebust On Fri, Sep 23, 2005 at 01:50:26PM -0700, Bryan Henderson wrote: > It seems to me that the VFS interface is a lot better reason for nfsd to > be special and be an in-kernel application. But so far, I haven't seen > any argument that whatever nfsd needs out of VFS couldn't cleanly be added > to a system call interface. You say people have actually looked into it > and found that it has to be ugly; I just don't yet see why myself. For one thing, you do *not* keep locks on directories across the syscall boundary. Is that enough for you? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack\ 2005-09-23 21:02 ` NFS4 crack\ Al Viro @ 2005-09-26 16:29 ` Bryan Henderson 2005-09-26 17:13 ` Peter Staubach 0 siblings, 1 reply; 41+ messages in thread From: Bryan Henderson @ 2005-09-26 16:29 UTC (permalink / raw) To: Al Viro Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Neil Brown, Olaf Kirch, Peter Staubach, Trond Myklebust >On Fri, Sep 23, 2005 at 01:50:26PM -0700, Bryan Henderson wrote: >> It seems to me that the VFS interface is a lot better reason for nfsd to >> be special and be an in-kernel application. But so far, I haven't seen >> any argument that whatever nfsd needs out of VFS couldn't cleanly be added >> to a system call interface. You say people have actually looked into it >> and found that it has to be ugly; I just don't yet see why myself. > >For one thing, you do *not* keep locks on directories across the syscall >boundary. Is that enough for you? Well, I wouldn't want to. I can't think of anything that an NFS server does to a directory that couldn't be done cleanly with a single system call, much the way the POSIX system calls do. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack\ 2005-09-26 16:29 ` Bryan Henderson @ 2005-09-26 17:13 ` Peter Staubach 0 siblings, 0 replies; 41+ messages in thread From: Peter Staubach @ 2005-09-26 17:13 UTC (permalink / raw) To: Bryan Henderson Cc: Al Viro, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Neil Brown, Olaf Kirch, Trond Myklebust Bryan Henderson wrote: > >Well, I wouldn't want to. I can't think of anything that an NFS server >does to a directory that couldn't be done cleanly with a single system >call, much the way the POSIX system calls do. > I might object to the characterization of "cleanly". We would need system calls which matched the specific semantics of NFS operations. For example, we would need system calls which understood pre-operations and post-operation attributes. We would need system calls which understood 32 bit limits so that we could correctly implement NFS version 2. NFS version 3 and NFS version 2 are probably close enough that we could use a common set of system calls with appropriate flags, but it does not seem likely that these system calls would suffice for NFS version 4. We could end up with a whole lot of system calls, even if they all went through a common entry point into the kernel. Thanx... ps ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 18:07 ` Peter Staubach 2005-09-22 21:08 ` Bryan Henderson @ 2005-09-22 21:48 ` Nicholas Miell 2005-09-22 22:50 ` Greg Banks 2 siblings, 0 replies; 41+ messages in thread From: Nicholas Miell @ 2005-09-22 21:48 UTC (permalink / raw) To: Peter Staubach Cc: Trond Myklebust, Bryan Henderson, Neil Brown, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch On Thu, 2005-09-22 at 14:07 -0400, Peter Staubach wrote: > Trond Myklebust wrote: > > >to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach: > > > > > >>It seems to me that a "system call" could implemented which would allow > >>a file to be "opened" via the file handle. > >> > >> > > > >Sure, but open alone isn't sufficient. A lot (most?) of the operations > >involving filehandles are acting on directories. > > > >Imagine if someone renames a directory on the server while the NFS > >server is in the middle of an unlink() operation, for instance. > > > > Yup, although you could resolve that by introducing a whole set of > operations which work off of file descriptors, instead of pathnames. > Then, inside of the kernel, to do the real operation, the file > descriptor would get turned back into the inode, but without the > pathname look portion. Things like funlink(fd, name), fmkdir(fd, name), > frmdir(fd, name), etc. Other operating systems have implemented at > least a subset of these sorts of calls and it gets ugly quickly. Solaris 10 calls them fchownat(2), fstatat(2), futimesat(2), openat(2), renameat(2), and unlinkat(2). They mostly exist to support their extended attributes implementation (hence the "at" postfix, and not to be confused with Linux's xattrs), but they work for general filesystem usage. Besides being an interface to extended attributes and maybe making an userspace NFSd feasible, they probably also improve filename lookup performance on sufficiently deep directory heirarchies (think of httpd opening /var/www/vservers/www.blah.com/html/ and then resolving everything for that vserver relative to the cached fd). -- Nicholas Miell <nmiell@comcast.net> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 18:07 ` Peter Staubach 2005-09-22 21:08 ` Bryan Henderson 2005-09-22 21:48 ` NFS4 crack Nicholas Miell @ 2005-09-22 22:50 ` Greg Banks 2 siblings, 0 replies; 41+ messages in thread From: Greg Banks @ 2005-09-22 22:50 UTC (permalink / raw) To: Peter Staubach Cc: Trond Myklebust, Bryan Henderson, Neil Brown, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch On Thu, Sep 22, 2005 at 02:07:36PM -0400, Peter Staubach wrote: > Trond Myklebust wrote: > > >to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach: > > > >Sure, but open alone isn't sufficient. A lot (most?) of the operations > >involving filehandles are acting on directories. > > > >Imagine if someone renames a directory on the server while the NFS > >server is in the middle of an unlink() operation, for instance. > > Yup, although you could resolve that by introducing a whole set of > operations which work off of file descriptors, instead of pathnames. To see why this is a bad idea, google for the unforeseen security implications of Solaris' fchroot() syscall. Adding this kind of syscall is *not* cost-free, you just won't know the cost until it's too late to fix. > [...] there are performance issues as well. Performance sells boxes, selling boxes pays my bills, that's enough reason for me. The ability to do zero-copy efficiently and to (eventually) support RDMA into the page cache is enough reason for a kernel nfsd. Sendfile? don't make me laugh. Also, a kernel nfsd can see network packet boundaries and other information not visible through any existing network API, and it does so in nonblocking fashion, which enables it to bounds check RPC calls better than any userspace RPC implementation can. This is one reason why (e.g.) TCP XDR fragment header DoS attacks are much harder against a kernel based server than a userspace server. Another reason is that the kernel nfsd refuses to accept multiple-fragment RPC calls, which is impossible if you use the libc RPC server library. Userspace nfsd: just say no. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack 2005-09-22 17:38 ` Peter Staubach 2005-09-22 17:52 ` Trond Myklebust @ 2005-09-22 21:19 ` Bryan Henderson 1 sibling, 0 replies; 41+ messages in thread From: Bryan Henderson @ 2005-09-22 21:19 UTC (permalink / raw) To: Peter Staubach Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel, Neil Brown, Olaf Kirch, Trond Myklebust >Switching in and out of the kernel requires time and data copies, Does it? We've successfully eliminated copying with things like mmap, direct I/O, and sendfile. And while the common wisdom says switching in and out of kernel mode takes an eon, is that actually true on modern systems? Switching into the kernel is fundamentally a trivial operation: set a flag that says you're in privileged mode and load the instruction address register to point to a trusted instruction. In the past, I've seen systems that also switch address space when that happens, and have to purge TLB and/or processor cache on an address space switch. That's a significant slowdown. But do modern Linux systems suffer that way? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2005-09-26 17:14 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-09-18 10:21 NFS4 crack Christoph Hellwig 2005-09-18 14:36 ` J. Bruce Fields 2005-09-19 10:35 ` Christoph Hellwig 2005-09-19 13:04 ` Anton Altaparmakov 2005-09-19 13:35 ` J. Bruce Fields 2005-09-19 13:39 ` Christoph Hellwig 2005-09-19 14:07 ` J. Bruce Fields 2005-09-19 14:11 ` Christoph Hellwig 2005-09-19 17:13 ` Bryan Henderson 2005-09-19 17:16 ` Randy.Dunlap 2005-09-19 21:57 ` Bryan Henderson 2005-09-19 22:11 ` Randy.Dunlap 2005-09-20 0:17 ` Bryan Henderson 2005-09-19 18:02 ` Christoph Hellwig 2005-09-19 18:53 ` William A.(Andy) Adamson 2005-09-19 18:59 ` Christoph Hellwig 2005-09-19 22:04 ` Bryan Henderson 2005-09-19 19:01 ` J. Bruce Fields 2005-09-19 19:05 ` Christoph Hellwig 2005-09-19 20:31 ` J. Bruce Fields 2005-09-20 12:49 ` Greg KH 2005-09-20 15:10 ` William A.(Andy) Adamson 2005-09-20 18:37 ` Neil Brown 2005-09-21 7:44 ` Andrew Morton 2005-09-22 20:58 ` William A.(Andy) Adamson 2005-09-21 13:41 ` Trond Myklebust 2005-09-21 14:40 ` J. Bruce Fields 2005-09-22 16:28 ` Bryan Henderson 2005-09-22 16:52 ` Trond Myklebust 2005-09-22 17:38 ` Peter Staubach 2005-09-22 17:52 ` Trond Myklebust 2005-09-22 18:07 ` Peter Staubach 2005-09-22 21:08 ` Bryan Henderson 2005-09-23 12:17 ` Peter Staubach 2005-09-23 20:50 ` Bryan Henderson 2005-09-23 21:02 ` NFS4 crack\ Al Viro 2005-09-26 16:29 ` Bryan Henderson 2005-09-26 17:13 ` Peter Staubach 2005-09-22 21:48 ` NFS4 crack Nicholas Miell 2005-09-22 22:50 ` Greg Banks 2005-09-22 21:19 ` Bryan Henderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).