* NFS4 crack
@ 2005-09-18 10:21 Christoph Hellwig
2005-09-18 14:36 ` J. Bruce Fields
2005-09-20 18:37 ` Neil Brown
0 siblings, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2005-09-18 10:21 UTC (permalink / raw)
To: akpm, neilb, andros, bfields; +Cc: linux-fsdevel
I've recently turned on NFS4 server support accidentally, just to get
error messages like:
"NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist"
To my horror I found out that this comes from kernel code, which messes
with a hardcoded directory, completelyu ingoring any namespace or other
uses issues. The fs handling in fs/nfs/nfs4recovery.c is rather broken
in addition.
All this comes from "[PATCH] knfsd: nfsd4: initialize recovery directory",
commit ID 190e4fbf96037e5e526ba3210f2bcc2a3b6fe964.
Andrew, could you please back this out again, and NFS folks, please
don't do stuff like that and hide your crackpipe somewhere. And please
we really need someone sane review NFS patches I thinkg.
(not cc'ed to the nfs list because of its stupid subsribers only
policy)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-18 10:21 NFS4 crack Christoph Hellwig
@ 2005-09-18 14:36 ` J. Bruce Fields
2005-09-19 10:35 ` Christoph Hellwig
2005-09-20 18:37 ` Neil Brown
1 sibling, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2005-09-18 14:36 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel
On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote:
> I've recently turned on NFS4 server support accidentally, just to get
> error messages like:
>
> "NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist"
>
> To my horror I found out that this comes from kernel code, which messes
> with a hardcoded directory, completelyu ingoring any namespace or other
> uses issues.
As long as all nfsd threads are in the same namespace, I don't see any
namespace issues. What am I missing?
> The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition.
For example?
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-18 14:36 ` J. Bruce Fields
@ 2005-09-19 10:35 ` Christoph Hellwig
2005-09-19 13:04 ` Anton Altaparmakov
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Christoph Hellwig @ 2005-09-19 10:35 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel
On Sun, Sep 18, 2005 at 10:36:15AM -0400, J. Bruce Fields wrote:
> On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote:
> > I've recently turned on NFS4 server support accidentally, just to get
> > error messages like:
> >
> > "NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist"
> >
> > To my horror I found out that this comes from kernel code, which messes
> > with a hardcoded directory, completelyu ingoring any namespace or other
> > uses issues.
>
> As long as all nfsd threads are in the same namespace, I don't see any
> namespace issues. What am I missing?
Namespaces issues above was meant as kernel can't assume namespace at
all, not even thinking about multiple namespaces which makes it even
more wrong. Who sais I allow the kernel to mess with
/var/lib/nfs/v4recover? Who tells any userspace process is even in the
same namespace as the nfs threads to create the directories?
Kernel assuming any namespace is wrong and we don't do it anywhere.
>
> > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition.
>
> For example?
- opens a directory O_RDWR which open_namei wouldn't even allow
- tries to build dentry list from vfs_readdir callback, leading to
deadlocks on filesystems that take the same lock from readdir
and lookup
- resets fsuid/fsgids without checks, synchronization or callouts
into subsystems that care (security, keys, ptrace)
- looks up /var/lib/nfs/v4recovery without ensuring it's a directory
and probably a few more if one tried to look at it for more than five
minutes. This is code that could be a third of the size if written
in userpsace and actually had a chance to be correct there, nevermind
the policy violations.
Please remove the code and never ever try to sneak in something like
that again. Thanks.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 10:35 ` Christoph Hellwig
@ 2005-09-19 13:04 ` Anton Altaparmakov
2005-09-19 13:35 ` J. Bruce Fields
2005-09-19 20:31 ` J. Bruce Fields
2 siblings, 0 replies; 41+ messages in thread
From: Anton Altaparmakov @ 2005-09-19 13:04 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: J. Bruce Fields, akpm, neilb, andros, linux-fsdevel
On Mon, 2005-09-19 at 12:35 +0200, Christoph Hellwig wrote:
> On Sun, Sep 18, 2005 at 10:36:15AM -0400, J. Bruce Fields wrote:
> > On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote:
[snip]
> > > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition.
> >
> > For example?
[snip]
> - tries to build dentry list from vfs_readdir callback, leading to
> deadlocks on filesystems that take the same lock from readdir
> and lookup
NFSv3 has always done this and yes it did lead to deadlock in ntfs so I
had to work around it in ntfs to get it to work. I had to redesign how
the locking worked which was really annoying thing to have to do. )-:
Just pointing this out as it seems to be commonplace for nfs and nothing
new...
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 10:35 ` Christoph Hellwig
2005-09-19 13:04 ` Anton Altaparmakov
@ 2005-09-19 13:35 ` J. Bruce Fields
2005-09-19 13:39 ` Christoph Hellwig
2005-09-19 20:31 ` J. Bruce Fields
2 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2005-09-19 13:35 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel
On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote:
> Namespaces issues above was meant as kernel can't assume namespace at
> all, not even thinking about multiple namespaces which makes it even
> more wrong. Who sais I allow the kernel to mess with
> /var/lib/nfs/v4recover?
It's run-time configurable if you don't like the default.
> Who tells any userspace process is even in the same namespace as the
> nfs threads to create the directories?
No userspace process is likely to care, except maybe for debugging
purposes. This isn't a userspace<->kernel interface, it's just a way to
store some information on disk so nfsd can find it again on next boot.
> Kernel assuming any namespace is wrong and we don't do it anywhere.
Well, nfsd does have some assumptions--mountd, exportfs, and nfsd all
have to be in the same namespace, for example. (Or at least namespaces
that are identical on exported paths.)
> > > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition.
> >
> > For example?
>
> - opens a directory O_RDWR which open_namei wouldn't even allow
> - tries to build dentry list from vfs_readdir callback, leading to
> deadlocks on filesystems that take the same lock from readdir
> and lookup
> - resets fsuid/fsgids without checks, synchronization or callouts
> into subsystems that care (security, keys, ptrace)
> - looks up /var/lib/nfs/v4recovery without ensuring it's a directory
>
> and probably a few more if one tried to look at it for more than five
> minutes.
Are you sure about readdir? It looks to me like nfsd has done lookups
there for some time--see, e.g., fs/nfsd/nfs3xdr.c:compose_entry_fh().
But I'll read through it again and check the other stuff you mention,
thanks.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 13:35 ` J. Bruce Fields
@ 2005-09-19 13:39 ` Christoph Hellwig
2005-09-19 14:07 ` J. Bruce Fields
2005-09-19 17:13 ` Bryan Henderson
0 siblings, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2005-09-19 13:39 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel
On Mon, Sep 19, 2005 at 09:35:28AM -0400, J. Bruce Fields wrote:
> On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote:
> > Namespaces issues above was meant as kernel can't assume namespace at
> > all, not even thinking about multiple namespaces which makes it even
> > more wrong. Who sais I allow the kernel to mess with
> > /var/lib/nfs/v4recover?
>
> It's run-time configurable if you don't like the default.
>
> > Who tells any userspace process is even in the same namespace as the
> > nfs threads to create the directories?
>
> No userspace process is likely to care, except maybe for debugging
> purposes. This isn't a userspace<->kernel interface, it's just a way to
> store some information on disk so nfsd can find it again on next boot.
Again,
FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
if that wasn't clear enough. You can't contiune enumerating the special
cases in that i could actually work somehow, but that doesn't make the
code any better. We have a strong policy to not have hardcoded
filenames in the kernel (although a few week abstractions where we
pass something very similar to a filename up to userspace to act on it),
and we're not going to make an exception for NFSv4. Especially as this
code would be much simpler in userspace as already mentioned. Directory
handling is something that can't be done sanely in kernelspace.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 13:39 ` Christoph Hellwig
@ 2005-09-19 14:07 ` J. Bruce Fields
2005-09-19 14:11 ` Christoph Hellwig
2005-09-19 17:13 ` Bryan Henderson
1 sibling, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2005-09-19 14:07 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel
On Mon, Sep 19, 2005 at 03:39:21PM +0200, Christoph Hellwig wrote:
> On Mon, Sep 19, 2005 at 09:35:28AM -0400, J. Bruce Fields wrote:
> > On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote:
> > > Namespaces issues above was meant as kernel can't assume namespace at
> > > all, not even thinking about multiple namespaces which makes it even
> > > more wrong. Who sais I allow the kernel to mess with
> > > /var/lib/nfs/v4recover?
> >
> > It's run-time configurable if you don't like the default.
> >
> > > Who tells any userspace process is even in the same namespace as the
> > > nfs threads to create the directories?
> >
> > No userspace process is likely to care, except maybe for debugging
> > purposes. This isn't a userspace<->kernel interface, it's just a way to
> > store some information on disk so nfsd can find it again on next boot.
>
> Again,
>
> FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
What problem does this create in this case?
The "hardcoded" path is just a default for a value that can be modified
at runtime. We could default to the empty string, I suppose, and make
sure the path is set in the nfs init scripts.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 14:07 ` J. Bruce Fields
@ 2005-09-19 14:11 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2005-09-19 14:11 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel
On Mon, Sep 19, 2005 at 10:07:15AM -0400, J. Bruce Fields wrote:
> > > No userspace process is likely to care, except maybe for debugging
> > > purposes. This isn't a userspace<->kernel interface, it's just a way to
> > > store some information on disk so nfsd can find it again on next boot.
> >
> > Again,
> >
> > FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
>
> What problem does this create in this case?
>
> The "hardcoded" path is just a default for a value that can be modified
> at runtime.
Umm, that's not the point at all. Pathnames are user policy and they
shouldn't be used from the kernel even configurable. File access from
kernelspace should be avoided whenver possible. NFSD is exception as
it needs to access file as part of it's job, but that exception doesn't
give it a wildcard to do random crap.
And the other point is that the code is utter crap and could be done
much better in userspace.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 13:39 ` Christoph Hellwig
2005-09-19 14:07 ` J. Bruce Fields
@ 2005-09-19 17:13 ` Bryan Henderson
2005-09-19 17:16 ` Randy.Dunlap
2005-09-19 18:02 ` Christoph Hellwig
1 sibling, 2 replies; 41+ messages in thread
From: Bryan Henderson @ 2005-09-19 17:13 UTC (permalink / raw)
To: Christoph Hellwig
Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig, linux-fsdevel,
neilb
>FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
I think that's a great policy, but we can't be all that righteous about it
because we don't do it today. I have a system that has highly customized
file names, so I'm pretty familiar with all the world's hardcoded file
names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
/sbin/modprobe. I could give /sbin/init and /bin/sh a pass because
they're involved in bootstrapping, which always breaks a few rules.
/sbin/modprobe is part of an application that is in the same boat as
NFSv4: an application that was born to be user space but after
considering the tradeoffs well, people decided to put them in the kernel
anyway. When you do that, it shouldn't be too surprising that people drag
some user space things like opening files by name with it.
In this case, though, there's an easy enough fix: something in user space
opens /var/lib/nfs/v4recover and passes the file handle to the kernel in a
server configuration step. This would be like what process accounting and
disk quota do.
That addresses the use of file names in the kernel; it's not to say there
aren't other problems with the present approach.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 17:13 ` Bryan Henderson
@ 2005-09-19 17:16 ` Randy.Dunlap
2005-09-19 21:57 ` Bryan Henderson
2005-09-19 18:02 ` Christoph Hellwig
1 sibling, 1 reply; 41+ messages in thread
From: Randy.Dunlap @ 2005-09-19 17:16 UTC (permalink / raw)
To: Bryan Henderson
Cc: Christoph Hellwig, akpm, andros, J. Bruce Fields, linux-fsdevel,
neilb
On Mon, 19 Sep 2005, Bryan Henderson wrote:
> >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
>
> I think that's a great policy, but we can't be all that righteous about it
> because we don't do it today. I have a system that has highly customized
> file names, so I'm pretty familiar with all the world's hardcoded file
> names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
> /sbin/modprobe. I could give /sbin/init and /bin/sh a pass because
agreed.
> they're involved in bootstrapping, which always breaks a few rules.
> /sbin/modprobe is part of an application that is in the same boat as
> NFSv4: an application that was born to be user space but after
> considering the tradeoffs well, people decided to put them in the kernel
> anyway. When you do that, it shouldn't be too surprising that people drag
> some user space things like opening files by name with it.
modprobe executable filename comes from here:
rddunlap@vortex:/proc/sys/kernel> cat modprobe
/sbin/modprobe
> In this case, though, there's an easy enough fix: something in user space
> opens /var/lib/nfs/v4recover and passes the file handle to the kernel in a
> server configuration step. This would be like what process accounting and
> disk quota do.
>
> That addresses the use of file names in the kernel; it's not to say there
> aren't other problems with the present approach.
--
~Randy
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 17:13 ` Bryan Henderson
2005-09-19 17:16 ` Randy.Dunlap
@ 2005-09-19 18:02 ` Christoph Hellwig
2005-09-19 18:53 ` William A.(Andy) Adamson
2005-09-19 19:01 ` J. Bruce Fields
1 sibling, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2005-09-19 18:02 UTC (permalink / raw)
To: Bryan Henderson
Cc: Christoph Hellwig, akpm, andros, J. Bruce Fields, linux-fsdevel,
neilb
On Mon, Sep 19, 2005 at 10:13:49AM -0700, Bryan Henderson wrote:
> >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
>
> I think that's a great policy, but we can't be all that righteous about it
> because we don't do it today. I have a system that has highly customized
> file names, so I'm pretty familiar with all the world's hardcoded file
> names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
> /sbin/modprobe.
They are not nice, but quite a bit different, as we are trying to execute
them, which can't have bad side-effects in case they don't exist.
What nfsd does is expecting a directory to be present on which it can
do various operations. That's much worse then trying to execute or
even read from a file. Besides that all this directory handling really
belongs into userland as pointed out _three times_ now.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 18:02 ` Christoph Hellwig
@ 2005-09-19 18:53 ` William A.(Andy) Adamson
2005-09-19 18:59 ` Christoph Hellwig
2005-09-19 22:04 ` Bryan Henderson
2005-09-19 19:01 ` J. Bruce Fields
1 sibling, 2 replies; 41+ messages in thread
From: William A.(Andy) Adamson @ 2005-09-19 18:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Bryan Henderson, Christoph Hellwig, akpm, andros, J. Bruce Fields,
linux-fsdevel, neilb
> On Mon, Sep 19, 2005 at 10:13:49AM -0700, Bryan Henderson wrote:
> > >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
> >
> > I think that's a great policy, but we can't be all that righteous about it
> > because we don't do it today. I have a system that has highly customized
> > file names, so I'm pretty familiar with all the world's hardcoded file
> > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
> > /sbin/modprobe.
>
> They are not nice, but quite a bit different, as we are trying to execute
> them, which can't have bad side-effects in case they don't exist.
>
> What nfsd does is expecting a directory to be present on which it can
> do various operations. That's much worse then trying to execute or
> even read from a file.
what we could do is not provide a default, and turn off reboot recovery (no
grace period) if the recovery directory is not configured.
> Besides that all this directory handling really
> belongs into userland as pointed out _three times_ now.
We were anticipating placing data into files in the recovery directory at each
OPEN and each LOCK call in order to limit the scope of the NFSv4 grace period
to the state that was actually in use prior to the reboot. We therefore went
ahead with a kernel implementation for performance reasons.
-->Andy
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 18:53 ` William A.(Andy) Adamson
@ 2005-09-19 18:59 ` Christoph Hellwig
2005-09-19 22:04 ` Bryan Henderson
1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2005-09-19 18:59 UTC (permalink / raw)
To: William A.(Andy) Adamson
Cc: Christoph Hellwig, Bryan Henderson, Christoph Hellwig, akpm,
J. Bruce Fields, linux-fsdevel, neilb
On Mon, Sep 19, 2005 at 02:53:36PM -0400, William A.(Andy) Adamson wrote:
> > They are not nice, but quite a bit different, as we are trying to execute
> > them, which can't have bad side-effects in case they don't exist.
> >
> > What nfsd does is expecting a directory to be present on which it can
> > do various operations. That's much worse then trying to execute or
> > even read from a file.
>
> what we could do is not provide a default, and turn off reboot recovery (no
> grace period) if the recovery directory is not configured.
>
> > Besides that all this directory handling really
> > belongs into userland as pointed out _three times_ now.
>
> We were anticipating placing data into files in the recovery directory at each
> OPEN and each LOCK call in order to limit the scope of the NFSv4 grace period
> to the state that was actually in use prior to the reboot. We therefore went
> ahead with a kernel implementation for performance reasons.
Then pass in a file descriptor for the each client. Doing all this
directory operations is not an option - if you need to do actual file I/O
to them that's less of an problem.
And please discuss such design issues here on -fsdevel.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 18:02 ` Christoph Hellwig
2005-09-19 18:53 ` William A.(Andy) Adamson
@ 2005-09-19 19:01 ` J. Bruce Fields
2005-09-19 19:05 ` Christoph Hellwig
1 sibling, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2005-09-19 19:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Bryan Henderson, Christoph Hellwig, akpm, andros, linux-fsdevel,
neilb
On Mon, Sep 19, 2005 at 07:02:40PM +0100, Christoph Hellwig wrote:
> On Mon, Sep 19, 2005 at 10:13:49AM -0700, Bryan Henderson wrote:
> > >FILENAMES ARE POLICY AND HAVE NO BUSINESS IN THE KERNEL
> >
> > I think that's a great policy, but we can't be all that righteous about it
> > because we don't do it today. I have a system that has highly customized
> > file names, so I'm pretty familiar with all the world's hardcoded file
> > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
> > /sbin/modprobe.
>
> They are not nice, but quite a bit different, as we are trying to execute
> them, which can't have bad side-effects in case they don't exist.
What bad side-effects are you thinking of here?
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 19:01 ` J. Bruce Fields
@ 2005-09-19 19:05 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2005-09-19 19:05 UTC (permalink / raw)
To: J. Bruce Fields
Cc: Christoph Hellwig, Bryan Henderson, Christoph Hellwig, akpm,
andros, linux-fsdevel, neilb
On Mon, Sep 19, 2005 at 03:01:17PM -0400, J. Bruce Fields wrote:
> > > because we don't do it today. I have a system that has highly customized
> > > file names, so I'm pretty familiar with all the world's hardcoded file
> > > names. ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
> > > /sbin/modprobe.
> >
> > They are not nice, but quite a bit different, as we are trying to execute
> > them, which can't have bad side-effects in case they don't exist.
>
> What bad side-effects are you thinking of here?
Sorry s/don't exist/& as expected/
think of your directory as symlink to something important, you'll just
mess with it confuse nfsd, whipe parts out. All kinds of nasty things
can happen.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 10:35 ` Christoph Hellwig
2005-09-19 13:04 ` Anton Altaparmakov
2005-09-19 13:35 ` J. Bruce Fields
@ 2005-09-19 20:31 ` J. Bruce Fields
2005-09-20 12:49 ` Greg KH
2 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2005-09-19 20:31 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, neilb, andros, linux-fsdevel
On Mon, Sep 19, 2005 at 12:35:47PM +0200, Christoph Hellwig wrote:
> On Sun, Sep 18, 2005 at 10:36:15AM -0400, J. Bruce Fields wrote:
> > On Sun, Sep 18, 2005 at 12:21:00PM +0200, Christoph Hellwig wrote:
> > > The fs handling in fs/nfs/nfs4recovery.c is rather broken in addition.
> >
> > For example?
>
> - opens a directory O_RDWR which open_namei wouldn't even allow
OK, thanks, fixed locally.
> - tries to build dentry list from vfs_readdir callback, leading to
> deadlocks on filesystems that take the same lock from readdir
> and lookup
So it appears that nfsd has long made the requirement that filesystems
not do this. Does this need to be documented somehwere?
> - resets fsuid/fsgids without checks, synchronization or callouts
> into subsystems that care (security, keys, ptrace)
I think the model here was nfsd_setuser(), which does essentially the
same thing. Is this an nfsd bug?
> - looks up /var/lib/nfs/v4recovery without ensuring it's a directory
Oops, thanks.
> and probably a few more if one tried to look at it for more than five
> minutes. This is code that could be a third of the size if written
> in userpsace and actually had a chance to be correct there, nevermind
> the policy violations.
That's a couple good bugs identified, thanks, but I'm not convinced that
this would be significantly simpler from userspace.
We'd need two pieces of user<->kernel interface:
1. An upcall to userspace to tell it about new client state. We
also need to be able to wait for userspace to commit something
to disk, as the information has to survive a reboot.
2. A way for userspace to dump recorded state to the kernel the
next time nfsd starts up.
Number 1 could be done with something like hotplug, I guess. (It can be
told to wait for the userspace helper to exit, right?)
Another file in the nfsd filesystem might work for the second interface.
We also considered accomplishing number 1 by appending records to a log
file. Userspace could hand in a file descriptor to use for this
purpose. We'd still need the second interface.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 17:16 ` Randy.Dunlap
@ 2005-09-19 21:57 ` Bryan Henderson
2005-09-19 22:11 ` Randy.Dunlap
0 siblings, 1 reply; 41+ messages in thread
From: Bryan Henderson @ 2005-09-19 21:57 UTC (permalink / raw)
To: Randy.Dunlap
Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig, linux-fsdevel,
neilb
>>ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
>> /sbin/modprobe.
>modprobe executable filename comes from here:
>rddunlap@vortex:/proc/sys/kernel> cat modprobe
>/sbin/modprobe
Did you mean this as a contradiction? Because it isn't. /sbin/modprobe
is hardcoded in the kernel as the default name of the module loader
program. More importantly, even if you set the module loader program name
externally, the kernel is still accessing that file by name, and that's
less than desirable.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 18:53 ` William A.(Andy) Adamson
2005-09-19 18:59 ` Christoph Hellwig
@ 2005-09-19 22:04 ` Bryan Henderson
1 sibling, 0 replies; 41+ messages in thread
From: Bryan Henderson @ 2005-09-19 22:04 UTC (permalink / raw)
To: William A.(Andy) Adamson
Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig,
Christoph Hellwig, linux-fsdevel, neilb
>what we could do is not provide a default, and turn off reboot recovery
(no
>grace period) if the recovery directory is not configured.
It sounds like you're still talking about configuring a file name into the
kernel -- just doing it at run time instead of build time. While better,
I'd really rather not see the kernel access files by names. Giving the
kernel file descriptors would be better.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 21:57 ` Bryan Henderson
@ 2005-09-19 22:11 ` Randy.Dunlap
2005-09-20 0:17 ` Bryan Henderson
0 siblings, 1 reply; 41+ messages in thread
From: Randy.Dunlap @ 2005-09-19 22:11 UTC (permalink / raw)
To: Bryan Henderson
Cc: Randy.Dunlap, akpm, andros, J. Bruce Fields, Christoph Hellwig,
linux-fsdevel, neilb
On Mon, 19 Sep 2005, Bryan Henderson wrote:
> >>ISTR the Linux kernel hardcodes /sbin/init, /bin/sh, and
> >> /sbin/modprobe.
> >modprobe executable filename comes from here:
> >rddunlap@vortex:/proc/sys/kernel> cat modprobe
> >/sbin/modprobe
>
> Did you mean this as a contradiction? Because it isn't. /sbin/modprobe
> is hardcoded in the kernel as the default name of the module loader
> program. More importantly, even if you set the module loader program name
> externally, the kernel is still accessing that file by name, and that's
> less than desirable.
Yes, there's a hard-coded default value for the module loader.
That doesn't sound bad to me.
So if I choose to use /sbin/bhloader (i.e., I set
/proc/sys/kernel/modprobe to "/sbin/bhloader"),
what's the problem? How should the kernel access that file?
And the kernel doesn't really access that file per se.
It just calls call_usermodehelper() to start a task and modprobe_path
is one of the parameters there.
Thanks,
--
~Randy
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 22:11 ` Randy.Dunlap
@ 2005-09-20 0:17 ` Bryan Henderson
0 siblings, 0 replies; 41+ messages in thread
From: Bryan Henderson @ 2005-09-20 0:17 UTC (permalink / raw)
To: Randy.Dunlap
Cc: akpm, andros, J. Bruce Fields, Christoph Hellwig, linux-fsdevel,
neilb, Randy.Dunlap
>Yes, there's a hard-coded default value for the module loader.
>That doesn't sound bad to me.
>
>So if I choose to use /sbin/bhloader (i.e., I set
>/proc/sys/kernel/modprobe to "/sbin/bhloader"),
>what's the problem? How should the kernel access that file?
Remember that the main point of this subthread is that the situation
complained of in the NFSv4 kernel code already exists, with the question
of whether that is an OK situation considered separately. The NFSv4 code
also has a hardcoded default file name (well, filesystem object name
anyway) that can be overridden by the user, but the kernel code identifies
the filesystem object by name when it's time to use it, in any case.
Christoph points out that the practical ramifications are less with
/sbin/modprobe because it's an executable file and tends to exist always,
but at a more basic level, the two are analogous.
But I do find the situation objectionable (in both cases), because I
prefer layering. I prefer that the guts of the kernel know nothing about
the file name space, which means "/sbin/modprobe" can't be special in any
way, and the kernel can't request any service by file name.
If it were up to me, the kernel would inform a user space process that a
module needs to be loaded and it would be up to that process to decide
from what file to get the loader program (maybe based on a config file in
/etc). The kernel would never know the name of that file. I believe it
used to be that way, so apparently someone else had different priorities.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-19 20:31 ` J. Bruce Fields
@ 2005-09-20 12:49 ` Greg KH
2005-09-20 15:10 ` William A.(Andy) Adamson
0 siblings, 1 reply; 41+ messages in thread
From: Greg KH @ 2005-09-20 12:49 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Christoph Hellwig, akpm, neilb, andros, linux-fsdevel
On Mon, Sep 19, 2005 at 04:31:43PM -0400, J. Bruce Fields wrote:
> We'd need two pieces of user<->kernel interface:
>
> 1. An upcall to userspace to tell it about new client state. We
> also need to be able to wait for userspace to commit something
> to disk, as the information has to survive a reboot.
> 2. A way for userspace to dump recorded state to the kernel the
> next time nfsd starts up.
>
> Number 1 could be done with something like hotplug, I guess. (It can be
> told to wait for the userspace helper to exit, right?)
Well, calling /sbin/hotplug itself can't be told to wait, especially as
that value is being set to NULL by most distros these days, as they are
using netlink instead.
Good luck,
greg k-h
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-20 12:49 ` Greg KH
@ 2005-09-20 15:10 ` William A.(Andy) Adamson
0 siblings, 0 replies; 41+ messages in thread
From: William A.(Andy) Adamson @ 2005-09-20 15:10 UTC (permalink / raw)
To: Greg KH
Cc: J. Bruce Fields, Christoph Hellwig, akpm, neilb, andros,
linux-fsdevel, andros
> On Mon, Sep 19, 2005 at 04:31:43PM -0400, J. Bruce Fields wrote:
> > We'd need two pieces of user<->kernel interface:
> >
> > 1. An upcall to userspace to tell it about new client state. We
> > also need to be able to wait for userspace to commit something
> > to disk, as the information has to survive a reboot.
> > 2. A way for userspace to dump recorded state to the kernel the
> > next time nfsd starts up.
> >
> > Number 1 could be done with something like hotplug, I guess. (It can be
> > told to wait for the userspace helper to exit, right?)
>
> Well, calling /sbin/hotplug itself can't be told to wait, especially as
> that value is being set to NULL by most distros these days, as they are
> using netlink instead.
>
call_usermodehelper_keys() with the wait status is what we are thinking of
using for #1.
note that the keyring code which uses call_usermodehelper_keys also hard codes
an executable name.
security/keys/request_key.c:
/* set up the argument list */
i = 0;
argv[i++] = "/sbin/request-key";
argv[i++] = (char *) op;
-->Andy
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-18 10:21 NFS4 crack Christoph Hellwig
2005-09-18 14:36 ` J. Bruce Fields
@ 2005-09-20 18:37 ` Neil Brown
2005-09-21 7:44 ` Andrew Morton
` (3 more replies)
1 sibling, 4 replies; 41+ messages in thread
From: Neil Brown @ 2005-09-20 18:37 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, andros, bfields, linux-fsdevel, Olaf Kirch
On Sunday September 18, hch@lst.de wrote:
> I've recently turned on NFS4 server support accidentally, just to get
> error messages like:
>
> "NFSD: recovery directory /var/lib/nfs/v4recovery doesn't exist"
>
> To my horror I found out that this comes from kernel code, which messes
> with a hardcoded directory, completelyu ingoring any namespace or other
> uses issues. The fs handling in fs/nfs/nfs4recovery.c is rather broken
> in addition.
>
> All this comes from "[PATCH] knfsd: nfsd4: initialize recovery directory",
> commit ID 190e4fbf96037e5e526ba3210f2bcc2a3b6fe964.
I confess that I am having trouble finding a convincing basis for your
position, which is why I allowed the patch through in the first place
(despite not particularly liking it).
My problem is: where do you draw the line?
It should be noted first that nfsd is unlike most (all?) other kernel
code. It is an application that is running in-kernel. It is a
consumer of kernel services, and provides no (significant) services to
user-space, or to other parts of the kernel.
Now, this in-kernel-application needs to store stable
application-specific data somewhere. May it:
1/ open a directory and create files in it and write to them
2/ open a directory and create files provided that the name of the
directory is given by userspace
3/ create files in a directory that was created by userspace and
given to the knfsd application as a filedescriptor
4/ write data to files which were created and opened by used-space
based on filenames provided by knfsd (hostnames or equivalents in
this case).
5/ pass the data to userspace and let it worry completely.
6/ sorry, you cannot have application-specific state.
I'm sure you will see a progression here. I ask again: "where do you
draw the line?" You seem to rule out 1, and probably 2, and possibly
3 based on other comments in the thread. It cannot set a rational
place to draw the line other than before-1 or after-4. i.e. if you
allow 4, you may as well allow 1 too.
If you have give a clear argument for some particular place to draw
the line, I'd love to hear it, together with your justification.
While considering it, you might also like to consider:
- is it ok for knfsd to bind to port 2049 ?
- is it ok if userspace tells it the number '2049' ?
- does user-space have to create/bind the socket and pass it to
knfsd?
- does user-space have to receive the packets and pass them to knfsd?
(ok, that one is really silly).
and "why?"
The reality is that NFS service is an application. Currently parts of
it are in-kernel (nfsd, lockd) and parts are in user-space (portmap,
statd(*), mountd).
There are two positions on what-goes-where that make sense to me:
1- pragmatism: put code where it works best. I believe that the
current code fits pragmatism quite well (modulo bugs).
2- "rightness": If you want to argue from a what-belongs-where
perspective, you have to say that knfsd doesn't belong in the
kernel at all. The kernel should just supply the core services
(e.g. file-handle <-> fd mapping) and let userspace do the rest.
Were I starting to write knfsd today, I would pick 2. Given where we
actually are today, I pick 1.
> we really need someone sane review NFS patches I thinkg.
yes please.. pretty please :-)
>
> (not cc'ed to the nfs list because of its stupid subsribers only
> policy)
Sad, isn't it. Both nfs@lists.sourceforge.net and nfsv4@linux-nfs.org
are like that, and nfs-devel@linux.kernel.org died long ago. :-(
NeilBrown
(*) There are patches in existence which move statd implementation
into the kernel. The final conclusion here may well affect those
patches, so I hope Olaf has been listening in....
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-20 18:37 ` Neil Brown
@ 2005-09-21 7:44 ` Andrew Morton
2005-09-22 20:58 ` William A.(Andy) Adamson
2005-09-21 13:41 ` Trond Myklebust
` (2 subsequent siblings)
3 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2005-09-21 7:44 UTC (permalink / raw)
To: Neil Brown; +Cc: hch, andros, bfields, linux-fsdevel, okir
Neil Brown <neilb@suse.de> wrote:
>
> Now, this in-kernel-application needs to store stable
> application-specific data somewhere. May it:
>
> 1/ open a directory and create files in it and write to them
> 2/ open a directory and create files provided that the name of the
> directory is given by userspace
> 3/ create files in a directory that was created by userspace and
> given to the knfsd application as a filedescriptor
> 4/ write data to files which were created and opened by used-space
> based on filenames provided by knfsd (hostnames or equivalents in
> this case).
> 5/ pass the data to userspace and let it worry completely.
> 6/ sorry, you cannot have application-specific state.
>
5/ sounds good. There are numerous options, newly including connector and
configfs.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-20 18:37 ` Neil Brown
2005-09-21 7:44 ` Andrew Morton
@ 2005-09-21 13:41 ` Trond Myklebust
2005-09-21 14:40 ` J. Bruce Fields
2005-09-22 16:28 ` Bryan Henderson
3 siblings, 0 replies; 41+ messages in thread
From: Trond Myklebust @ 2005-09-21 13:41 UTC (permalink / raw)
To: Neil Brown
Cc: Christoph Hellwig, akpm, andros, bfields, linux-fsdevel,
Olaf Kirch
on den 21.09.2005 Klokka 04:37 (+1000) skreiv Neil Brown:
> Sad, isn't it. Both nfs@lists.sourceforge.net and nfsv4@linux-nfs.org
> are like that, and nfs-devel@linux.kernel.org died long ago. :-(
I can set up an unmoderated NFS list on linux-nfs.org if there is a
demand for it.
I could also open up nfsv4@linux-nfs.org if that is desirable.
However my preference would be to see the admins for
nfs@lists.sourceforge.net (whoever the hell is in that select list these
days) open that list up. There should be no need to keep duplicating all
these mailing lists, and nfs@lists is currently supposed to be the
generic NFS list.
Cheers,
Trond
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-20 18:37 ` Neil Brown
2005-09-21 7:44 ` Andrew Morton
2005-09-21 13:41 ` Trond Myklebust
@ 2005-09-21 14:40 ` J. Bruce Fields
2005-09-22 16:28 ` Bryan Henderson
3 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2005-09-21 14:40 UTC (permalink / raw)
To: Neil Brown; +Cc: Christoph Hellwig, akpm, andros, linux-fsdevel, Olaf Kirch
On Wed, Sep 21, 2005 at 04:37:36AM +1000, Neil Brown wrote:
> On Sunday September 18, hch@lst.de wrote:
> > (not cc'ed to the nfs list because of its stupid subsribers only
> > policy)
>
> Sad, isn't it. Both nfs@lists.sourceforge.net and nfsv4@linux-nfs.org
> are like that, and nfs-devel@linux.kernel.org died long ago. :-(
The nfsv4@linux-nfs.org policy is to defer non-subscriber email for
moderation. There are a couple moderators, and we should usually be
able to moderate (and whitelist) anyone within a few hours. But we
could open it up more.
It'd also be nice to open up the sourceforge list some more--I think it
has the same sort of policty but the delays occasionally seem to be
measured in weeks.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-20 18:37 ` Neil Brown
` (2 preceding siblings ...)
2005-09-21 14:40 ` J. Bruce Fields
@ 2005-09-22 16:28 ` Bryan Henderson
2005-09-22 16:52 ` Trond Myklebust
3 siblings, 1 reply; 41+ messages in thread
From: Bryan Henderson @ 2005-09-22 16:28 UTC (permalink / raw)
To: Neil Brown
Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel,
Olaf Kirch
>2- "rightness": If you want to argue from a what-belongs-where
> perspective, you have to say that knfsd doesn't belong in the
> kernel at all. The kernel should just supply the core services
> (e.g. file-handle <-> fd mapping) and let userspace do the rest.
This is the real reason that it is so hard to draw that line, and why
fairly natural code in knfsd makes kernel programmers recoil in horror.
Maybe you could remind everyone why knfsd is in the kernel. If it's just
speed, what if anything would have to change in the structure of a system
to make it work as fast in user space?
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 16:28 ` Bryan Henderson
@ 2005-09-22 16:52 ` Trond Myklebust
2005-09-22 17:38 ` Peter Staubach
0 siblings, 1 reply; 41+ messages in thread
From: Trond Myklebust @ 2005-09-22 16:52 UTC (permalink / raw)
To: Bryan Henderson
Cc: Neil Brown, akpm, andros, bfields, Christoph Hellwig,
linux-fsdevel, Olaf Kirch
to den 22.09.2005 Klokka 09:28 (-0700) skreiv Bryan Henderson:
> Maybe you could remind everyone why knfsd is in the kernel. If it's just
> speed, what if anything would have to change in the structure of a system
> to make it work as fast in user space?
The main reason for keeping (part) of the NFS server in the kernel is
not speed, but coping with races.
In particular note that all NFS operations on files take an opaque
filehandle argument rather than a path. For instance, the operation
CREATE takes a filehandle argument in order to determine the path of the
directory in which to create the file, then a string argument to
determine the filename.
The set of filesystem-supplied helper function that converts a
filehandle into a dentry means that knfsd can do this safely without
danger of racing with rename() calls, unlink(),...
Trying to do the same thing in userland would have to involve first
converting the filehandle into a pathname, and then calling a POSIX
function using that pathname which is obviously very race prone.
Cheers,
Trond
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 16:52 ` Trond Myklebust
@ 2005-09-22 17:38 ` Peter Staubach
2005-09-22 17:52 ` Trond Myklebust
2005-09-22 21:19 ` Bryan Henderson
0 siblings, 2 replies; 41+ messages in thread
From: Peter Staubach @ 2005-09-22 17:38 UTC (permalink / raw)
To: Trond Myklebust
Cc: Bryan Henderson, Neil Brown, akpm, andros, bfields,
Christoph Hellwig, linux-fsdevel, Olaf Kirch
Trond Myklebust wrote:
>to den 22.09.2005 Klokka 09:28 (-0700) skreiv Bryan Henderson:
>
>
>
>>Maybe you could remind everyone why knfsd is in the kernel. If it's just
>>speed, what if anything would have to change in the structure of a system
>>to make it work as fast in user space?
>>
>>
>
>The main reason for keeping (part) of the NFS server in the kernel is
>not speed, but coping with races.
>
>In particular note that all NFS operations on files take an opaque
>filehandle argument rather than a path. For instance, the operation
>CREATE takes a filehandle argument in order to determine the path of the
>directory in which to create the file, then a string argument to
>determine the filename.
>The set of filesystem-supplied helper function that converts a
>filehandle into a dentry means that knfsd can do this safely without
>danger of racing with rename() calls, unlink(),...
>Trying to do the same thing in userland would have to involve first
>converting the filehandle into a pathname, and then calling a POSIX
>function using that pathname which is obviously very race prone.
>
It seems to me that a "system call" could implemented which would allow
a file to be "opened" via the file handle.
But then, we would be back to the speed argument. Switching in and out
of the kernel requires time and data copies, both of which are not good
and would kill any possibilities of making the Linux NFS server competitive.
ps
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 17:38 ` Peter Staubach
@ 2005-09-22 17:52 ` Trond Myklebust
2005-09-22 18:07 ` Peter Staubach
2005-09-22 21:19 ` Bryan Henderson
1 sibling, 1 reply; 41+ messages in thread
From: Trond Myklebust @ 2005-09-22 17:52 UTC (permalink / raw)
To: Peter Staubach
Cc: Bryan Henderson, Neil Brown, akpm, andros, bfields,
Christoph Hellwig, linux-fsdevel, Olaf Kirch
to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach:
> It seems to me that a "system call" could implemented which would allow
> a file to be "opened" via the file handle.
Sure, but open alone isn't sufficient. A lot (most?) of the operations
involving filehandles are acting on directories.
Imagine if someone renames a directory on the server while the NFS
server is in the middle of an unlink() operation, for instance.
Cheers,
Trond
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 17:52 ` Trond Myklebust
@ 2005-09-22 18:07 ` Peter Staubach
2005-09-22 21:08 ` Bryan Henderson
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Peter Staubach @ 2005-09-22 18:07 UTC (permalink / raw)
To: Trond Myklebust
Cc: Bryan Henderson, Neil Brown, akpm, andros, bfields,
Christoph Hellwig, linux-fsdevel, Olaf Kirch
Trond Myklebust wrote:
>to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach:
>
>
>>It seems to me that a "system call" could implemented which would allow
>>a file to be "opened" via the file handle.
>>
>>
>
>Sure, but open alone isn't sufficient. A lot (most?) of the operations
>involving filehandles are acting on directories.
>
>Imagine if someone renames a directory on the server while the NFS
>server is in the middle of an unlink() operation, for instance.
>
Yup, although you could resolve that by introducing a whole set of
operations which work off of file descriptors, instead of pathnames.
Then, inside of the kernel, to do the real operation, the file
descriptor would get turned back into the inode, but without the
pathname look portion. Things like funlink(fd, name), fmkdir(fd, name),
frmdir(fd, name), etc. Other operating systems have implemented at
least a subset of these sorts of calls and it gets ugly quickly.
The NFS server also has to do its own special checking and sometimes
this checking conflicts with the checking done in the normal "from
user mode" path.
---
Without a great deal of work and many new interfaces, there is no way
to get something like the NFS server to run correctly outside of the
kernel address space. There are correctness issues such as Trond has
pointed out and there are performance issues as well.
Is there inherent problem with the NFS server being implemented as an
alternate VFS layer in the kernel, with its own requirements? Or is
this an academic problem? Unless we are willing to consider moving to
a micro-kernel approach, ala Mach, then we are going to need to consider
the requirements of kernel based applications in addition to user level
applications.
Thanx...
ps
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-21 7:44 ` Andrew Morton
@ 2005-09-22 20:58 ` William A.(Andy) Adamson
0 siblings, 0 replies; 41+ messages in thread
From: William A.(Andy) Adamson @ 2005-09-22 20:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Neil Brown, hch, andros, bfields, linux-fsdevel, okir, andros
> Neil Brown <neilb@suse.de> wrote:
> >
> > Now, this in-kernel-application needs to store stable
> > application-specific data somewhere. May it:
> >
> > 1/ open a directory and create files in it and write to them
> > 2/ open a directory and create files provided that the name of the
> > directory is given by userspace
> > 3/ create files in a directory that was created by userspace and
> > given to the knfsd application as a filedescriptor
> > 4/ write data to files which were created and opened by used-space
> > based on filenames provided by knfsd (hostnames or equivalents in
> > this case).
> > 5/ pass the data to userspace and let it worry completely.
> > 6/ sorry, you cannot have application-specific state.
> >
>
> 5/ sounds good. There are numerous options, newly including connector and
> configfs.
alright. i'll look into a user space solution.
-->Andy
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 18:07 ` Peter Staubach
@ 2005-09-22 21:08 ` Bryan Henderson
2005-09-23 12:17 ` Peter Staubach
2005-09-22 21:48 ` NFS4 crack Nicholas Miell
2005-09-22 22:50 ` Greg Banks
2 siblings, 1 reply; 41+ messages in thread
From: Bryan Henderson @ 2005-09-22 21:08 UTC (permalink / raw)
To: Peter Staubach
Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel,
Neil Brown, Olaf Kirch, Trond Myklebust
>Yup, although you could resolve that by introducing a whole set of
>operations which work off of file descriptors, instead of pathnames.
To do the whole job, what you need is a set of system calls that work off
NFS file handles instead of path names, and you may even need a different
kind of open state, ergo file descriptor, and these system calls would
require special privilege.
And that's not so crazy -- Linux/Unix is long overdue for a more advanced
system call file interface than POSIX. NFS needs it; Windows
compatibility (e.g. Samba) needs it; backup, HSM, and storage management
need it. It just might not be practical in the near term.
It would be good to understand whether the NFS server is in the kernel for
basic structural reasons or just because we're too lazy to invent this new
system call interface, because that sheds light on how a normally user
space problem like storing persistent application data for NFSv4 should be
approached. Do we need a new kernel paradigm that admits file and
filename use within the kernel, or do we hold our nose and say, "what's
one more hack on top of an existing one?"
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 17:38 ` Peter Staubach
2005-09-22 17:52 ` Trond Myklebust
@ 2005-09-22 21:19 ` Bryan Henderson
1 sibling, 0 replies; 41+ messages in thread
From: Bryan Henderson @ 2005-09-22 21:19 UTC (permalink / raw)
To: Peter Staubach
Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel,
Neil Brown, Olaf Kirch, Trond Myklebust
>Switching in and out of the kernel requires time and data copies,
Does it? We've successfully eliminated copying with things like mmap,
direct I/O, and sendfile. And while the common wisdom says switching in
and out of kernel mode takes an eon, is that actually true on modern
systems? Switching into the kernel is fundamentally a trivial operation:
set a flag that says you're in privileged mode and load the instruction
address register to point to a trusted instruction. In the past, I've
seen systems that also switch address space when that happens, and have to
purge TLB and/or processor cache on an address space switch. That's a
significant slowdown. But do modern Linux systems suffer that way?
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 18:07 ` Peter Staubach
2005-09-22 21:08 ` Bryan Henderson
@ 2005-09-22 21:48 ` Nicholas Miell
2005-09-22 22:50 ` Greg Banks
2 siblings, 0 replies; 41+ messages in thread
From: Nicholas Miell @ 2005-09-22 21:48 UTC (permalink / raw)
To: Peter Staubach
Cc: Trond Myklebust, Bryan Henderson, Neil Brown, akpm, andros,
bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch
On Thu, 2005-09-22 at 14:07 -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
>
> >to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach:
> >
> >
> >>It seems to me that a "system call" could implemented which would allow
> >>a file to be "opened" via the file handle.
> >>
> >>
> >
> >Sure, but open alone isn't sufficient. A lot (most?) of the operations
> >involving filehandles are acting on directories.
> >
> >Imagine if someone renames a directory on the server while the NFS
> >server is in the middle of an unlink() operation, for instance.
> >
>
> Yup, although you could resolve that by introducing a whole set of
> operations which work off of file descriptors, instead of pathnames.
> Then, inside of the kernel, to do the real operation, the file
> descriptor would get turned back into the inode, but without the
> pathname look portion. Things like funlink(fd, name), fmkdir(fd, name),
> frmdir(fd, name), etc. Other operating systems have implemented at
> least a subset of these sorts of calls and it gets ugly quickly.
Solaris 10 calls them fchownat(2), fstatat(2), futimesat(2), openat(2),
renameat(2), and unlinkat(2). They mostly exist to support their
extended attributes implementation (hence the "at" postfix, and not to
be confused with Linux's xattrs), but they work for general filesystem
usage.
Besides being an interface to extended attributes and maybe making an
userspace NFSd feasible, they probably also improve filename lookup
performance on sufficiently deep directory heirarchies (think of httpd
opening /var/www/vservers/www.blah.com/html/ and then resolving
everything for that vserver relative to the cached fd).
--
Nicholas Miell <nmiell@comcast.net>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 18:07 ` Peter Staubach
2005-09-22 21:08 ` Bryan Henderson
2005-09-22 21:48 ` NFS4 crack Nicholas Miell
@ 2005-09-22 22:50 ` Greg Banks
2 siblings, 0 replies; 41+ messages in thread
From: Greg Banks @ 2005-09-22 22:50 UTC (permalink / raw)
To: Peter Staubach
Cc: Trond Myklebust, Bryan Henderson, Neil Brown, akpm, andros,
bfields, Christoph Hellwig, linux-fsdevel, Olaf Kirch
On Thu, Sep 22, 2005 at 02:07:36PM -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
>
> >to den 22.09.2005 Klokka 13:38 (-0400) skreiv Peter Staubach:
> >
> >Sure, but open alone isn't sufficient. A lot (most?) of the operations
> >involving filehandles are acting on directories.
> >
> >Imagine if someone renames a directory on the server while the NFS
> >server is in the middle of an unlink() operation, for instance.
>
> Yup, although you could resolve that by introducing a whole set of
> operations which work off of file descriptors, instead of pathnames.
To see why this is a bad idea, google for the unforeseen security
implications of Solaris' fchroot() syscall. Adding this kind of
syscall is *not* cost-free, you just won't know the cost until it's
too late to fix.
> [...] there are performance issues as well.
Performance sells boxes, selling boxes pays my bills, that's enough
reason for me. The ability to do zero-copy efficiently and to
(eventually) support RDMA into the page cache is enough reason for
a kernel nfsd. Sendfile? don't make me laugh.
Also, a kernel nfsd can see network packet boundaries and other
information not visible through any existing network API, and it does
so in nonblocking fashion, which enables it to bounds check RPC calls
better than any userspace RPC implementation can. This is one reason
why (e.g.) TCP XDR fragment header DoS attacks are much harder against
a kernel based server than a userspace server. Another reason is
that the kernel nfsd refuses to accept multiple-fragment RPC calls,
which is impossible if you use the libc RPC server library.
Userspace nfsd: just say no.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-22 21:08 ` Bryan Henderson
@ 2005-09-23 12:17 ` Peter Staubach
2005-09-23 20:50 ` Bryan Henderson
0 siblings, 1 reply; 41+ messages in thread
From: Peter Staubach @ 2005-09-23 12:17 UTC (permalink / raw)
To: Bryan Henderson
Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel,
Neil Brown, Olaf Kirch, Trond Myklebust
Bryan Henderson wrote:
>
>It would be good to understand whether the NFS server is in the kernel for
>basic structural reasons or just because we're too lazy to invent this new
>system call interface, because that sheds light on how a normally user
>space problem like storing persistent application data for NFSv4 should be
>approached. Do we need a new kernel paradigm that admits file and
>filename use within the kernel, or do we hold our nose and say, "what's
>one more hack on top of an existing one?"
>
The NFS server is in the kernel for basic structural reasons, but also for
performance reasons. I would be happy to hear and/or read a proposal on
how to get packets containing requests and/or responses in and out of the
kernel without copying them. Inside of the kernel, both can be handled
with no copies.
It isn't that we are too lazy, by the way. This issue gets looked into
every so often. The set of system calls can be determined pretty quickly
and implementing them, while tricky in spots, can be done. However, the
ugliness of the implementation soon starts to overwhelm the cleanliness
of the design.
--
I would even be happy with seeing a user mode local disk based file system
which performed as well as a kernel mode file system. That seems easier to
to me because then there wouldn't be any of those sticky networking issues
to worry about. When we get this, then we can consider the value of moving
something like NFS too.
Thanx...
ps
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack
2005-09-23 12:17 ` Peter Staubach
@ 2005-09-23 20:50 ` Bryan Henderson
2005-09-23 21:02 ` NFS4 crack\ Al Viro
0 siblings, 1 reply; 41+ messages in thread
From: Bryan Henderson @ 2005-09-23 20:50 UTC (permalink / raw)
To: Peter Staubach
Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel,
Neil Brown, Olaf Kirch, Trond Myklebust
>The NFS server is in the kernel for basic structural reasons, but also
for
>performance reasons.
The two are orthogonal. The faster kernel performance could be either
because of the basic structure of the system (to go that fast in user
space would be impossible or require ugly interfaces) or just convenience
(to go that fast in user space, someone would have to add some
interfaces).
>I would be happy to hear and/or read a proposal on
>how to get packets containing requests and/or responses in and out of the
>kernel without copying them. Inside of the kernel, both can be handled
>with no copies.
Proposals are beyond the scope of this conversation, since we're not
trying to design (or even argue for) user space nfsd but rather to
understand the dilemma of kernel code needing to do something (access
files by name) that we've always considered a non-kernel activity. But if
your point is that a decent proposal doesn't exist because zero-copy
network communication fundamentally has to be in the kernel, then what
about zero copy disk file access? (direct I/O, raw device, mmap). The
basic facility seems to be there. And if you really can't do network
communication as fast in user space as in the kernel, should we expect
other network applications with a high speed requirement to go in the
kernel too?
It seems to me that the VFS interface is a lot better reason for nfsd to
be special and be an in-kernel application. But so far, I haven't seen
any argument that whatever nfsd needs out of VFS couldn't cleanly be added
to a system call interface. You say people have actually looked into it
and found that it has to be ugly; I just don't yet see why myself.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack\
2005-09-23 20:50 ` Bryan Henderson
@ 2005-09-23 21:02 ` Al Viro
2005-09-26 16:29 ` Bryan Henderson
0 siblings, 1 reply; 41+ messages in thread
From: Al Viro @ 2005-09-23 21:02 UTC (permalink / raw)
To: Bryan Henderson
Cc: Peter Staubach, akpm, andros, bfields, Christoph Hellwig,
linux-fsdevel, Neil Brown, Olaf Kirch, Trond Myklebust
On Fri, Sep 23, 2005 at 01:50:26PM -0700, Bryan Henderson wrote:
> It seems to me that the VFS interface is a lot better reason for nfsd to
> be special and be an in-kernel application. But so far, I haven't seen
> any argument that whatever nfsd needs out of VFS couldn't cleanly be added
> to a system call interface. You say people have actually looked into it
> and found that it has to be ugly; I just don't yet see why myself.
For one thing, you do *not* keep locks on directories across the syscall
boundary. Is that enough for you?
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack\
2005-09-23 21:02 ` NFS4 crack\ Al Viro
@ 2005-09-26 16:29 ` Bryan Henderson
2005-09-26 17:13 ` Peter Staubach
0 siblings, 1 reply; 41+ messages in thread
From: Bryan Henderson @ 2005-09-26 16:29 UTC (permalink / raw)
To: Al Viro
Cc: akpm, andros, bfields, Christoph Hellwig, linux-fsdevel,
Neil Brown, Olaf Kirch, Peter Staubach, Trond Myklebust
>On Fri, Sep 23, 2005 at 01:50:26PM -0700, Bryan Henderson wrote:
>> It seems to me that the VFS interface is a lot better reason for nfsd
to
>> be special and be an in-kernel application. But so far, I haven't seen
>> any argument that whatever nfsd needs out of VFS couldn't cleanly be
added
>> to a system call interface. You say people have actually looked into
it
>> and found that it has to be ugly; I just don't yet see why myself.
>
>For one thing, you do *not* keep locks on directories across the syscall
>boundary. Is that enough for you?
Well, I wouldn't want to. I can't think of anything that an NFS server
does to a directory that couldn't be done cleanly with a single system
call, much the way the POSIX system calls do.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: NFS4 crack\
2005-09-26 16:29 ` Bryan Henderson
@ 2005-09-26 17:13 ` Peter Staubach
0 siblings, 0 replies; 41+ messages in thread
From: Peter Staubach @ 2005-09-26 17:13 UTC (permalink / raw)
To: Bryan Henderson
Cc: Al Viro, akpm, andros, bfields, Christoph Hellwig, linux-fsdevel,
Neil Brown, Olaf Kirch, Trond Myklebust
Bryan Henderson wrote:
>
>Well, I wouldn't want to. I can't think of anything that an NFS server
>does to a directory that couldn't be done cleanly with a single system
>call, much the way the POSIX system calls do.
>
I might object to the characterization of "cleanly". We would need system
calls which matched the specific semantics of NFS operations. For example,
we would need system calls which understood pre-operations and
post-operation
attributes. We would need system calls which understood 32 bit limits so
that we could correctly implement NFS version 2. NFS version 3 and NFS
version 2 are probably close enough that we could use a common set of
system calls with appropriate flags, but it does not seem likely that these
system calls would suffice for NFS version 4. We could end up with a whole
lot of system calls, even if they all went through a common entry point into
the kernel.
Thanx...
ps
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2005-09-26 17:14 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-18 10:21 NFS4 crack Christoph Hellwig
2005-09-18 14:36 ` J. Bruce Fields
2005-09-19 10:35 ` Christoph Hellwig
2005-09-19 13:04 ` Anton Altaparmakov
2005-09-19 13:35 ` J. Bruce Fields
2005-09-19 13:39 ` Christoph Hellwig
2005-09-19 14:07 ` J. Bruce Fields
2005-09-19 14:11 ` Christoph Hellwig
2005-09-19 17:13 ` Bryan Henderson
2005-09-19 17:16 ` Randy.Dunlap
2005-09-19 21:57 ` Bryan Henderson
2005-09-19 22:11 ` Randy.Dunlap
2005-09-20 0:17 ` Bryan Henderson
2005-09-19 18:02 ` Christoph Hellwig
2005-09-19 18:53 ` William A.(Andy) Adamson
2005-09-19 18:59 ` Christoph Hellwig
2005-09-19 22:04 ` Bryan Henderson
2005-09-19 19:01 ` J. Bruce Fields
2005-09-19 19:05 ` Christoph Hellwig
2005-09-19 20:31 ` J. Bruce Fields
2005-09-20 12:49 ` Greg KH
2005-09-20 15:10 ` William A.(Andy) Adamson
2005-09-20 18:37 ` Neil Brown
2005-09-21 7:44 ` Andrew Morton
2005-09-22 20:58 ` William A.(Andy) Adamson
2005-09-21 13:41 ` Trond Myklebust
2005-09-21 14:40 ` J. Bruce Fields
2005-09-22 16:28 ` Bryan Henderson
2005-09-22 16:52 ` Trond Myklebust
2005-09-22 17:38 ` Peter Staubach
2005-09-22 17:52 ` Trond Myklebust
2005-09-22 18:07 ` Peter Staubach
2005-09-22 21:08 ` Bryan Henderson
2005-09-23 12:17 ` Peter Staubach
2005-09-23 20:50 ` Bryan Henderson
2005-09-23 21:02 ` NFS4 crack\ Al Viro
2005-09-26 16:29 ` Bryan Henderson
2005-09-26 17:13 ` Peter Staubach
2005-09-22 21:48 ` NFS4 crack Nicholas Miell
2005-09-22 22:50 ` Greg Banks
2005-09-22 21:19 ` Bryan Henderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).