* [RFC] [PATCH 0/2] mkdir lookup optimization @ 2016-07-07 5:53 Oleg Drokin 2016-07-07 5:53 ` [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry Oleg Drokin ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Oleg Drokin @ 2016-07-07 5:53 UTC (permalink / raw) To: Trond Myklebust, Jeff Layton, Al Viro Cc: linux-fsdevel, linux-nfs, Oleg Drokin (sorry for resend, the first go around did not make it to fsdevel and to Al). This is inspired by a bug in Lustre that's ATM is shared by NFS and used o be shared by CIFS code. The problem at hand is: when you try to mkdir in a directory where you do not have permissions to create anything, you only supposed to get EPERM if the directory you are creatign does not exist. Now if the name does exist, you are supposed to get EEXIST instead. There are tons of programs that when fed a pathname go and try to perform a create of every path component starting from /, and ignoring EEXIST, but not other errors. Those programs are broken by the above mentioned bug. All is fine everywhere by Lustre and NFS at the moment, because there's an optimization at hand. e.g. in NFS: /* * If we're doing an exclusive create, optimize away the lookup * but don't hash the dentry. */ if (nfs_is_exclusive_create(dir, flags)) return NULL; Now, this is all fine except when you have no permissions to create anything - then vfs_mknod/mkdir/create will do may_create(dir, dentry) and we exit spuriously with EPERM. [green@fedora1 crash]$ mkdir aaa mkdir: cannot create directory 'aaa': Permission denied [green@fedora1 crash]$ mkdir lost+found mkdir: cannot create directory 'lost+found': Permission denied [green@fedora1 crash]$ ls -ld lost+found drwx------ 2 root root 16384 May 25 2013 lost+found [green@fedora1 crash]$ mkdir lost+found mkdir: cannot create directory 'lost+found': File exists cifs had exactly the same code, but it got removed when atomic_open was introduced (throwing away a perfectly good optimization for mkdir in process) with commit d2c127197dfc0b2bae62a52e1e0d3e3ff493919e "cifs: implement i_op->atomic_open()" These two patches are the lazy way of fixing the problem - "just throw in the extra permission check before bailing out" with a bit of complication on the NFS side because there the inode permission check is actually circumvented in nfs_permission, for MAY_WRITE | !MAY_READ case which is enough to fool may_create, but not enough to fool some following check, I guess as the problem still exists. (I am not sure of the performance implications of just removing that thing in nfs_permission). Anyway I think instead of resurrecting this optimization for cifs, and seeing if ceph and others need it, why not bring it up all the way to __lookup_hash() so that we don't do actual lookup if the parent is writeable? Even for local filesystems like ext4 that's of benefit - we save one lookup (even with hashed dirs, that only gives us the last blook to lookat and then we still need to check all names to make sure the one we want does not exist - so it's not exactly free). This should not upset any sort of client-side SELinux/other security stuff magic either. If the name exists, we get EEXIST no matter what, if it does not exist, parent policy declares if we can create or not anyway. Something like this (+ whatever nfs_permission fix): diff --git a/fs/namei.c b/fs/namei.c index 70580ab..b9de645 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1512,6 +1512,10 @@ static struct dentry *__lookup_hash(const struct qstr *name, if (unlikely(!dentry)) return ERR_PTR(-ENOMEM); + if ((flags & LOOKUP_EXCL|LOOKUP_CREATE) && + (may_create(base, dentry) == 0)) + return dentry; + return lookup_real(base->d_inode, dentry, flags); } Comments? Oleg Drokin (2): nfs: Fix spurios EPERM when mkdir of existing dentry staging/lustre: Prevent spurious EPERM on mkdir drivers/staging/lustre/lustre/llite/namei.c | 8 ++++++-- fs/nfs/dir.c | 4 +++- 2 files changed, 9 insertions(+), 3 deletions(-) -- 2.7.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 5:53 [RFC] [PATCH 0/2] mkdir lookup optimization Oleg Drokin @ 2016-07-07 5:53 ` Oleg Drokin 2016-07-07 16:16 ` Trond Myklebust 2016-07-07 5:53 ` [PATCH 2/2] staging/lustre: Prevent spurious EPERM on mkdir Oleg Drokin 2016-07-07 10:53 ` [RFC] [PATCH 0/2] mkdir lookup optimization Jeff Layton 2 siblings, 1 reply; 12+ messages in thread From: Oleg Drokin @ 2016-07-07 5:53 UTC (permalink / raw) To: Trond Myklebust, Jeff Layton, Al Viro Cc: linux-fsdevel, linux-nfs, Oleg Drokin It's great when we can shave an extra RPC, but not at the expense of correctness. We should not return EPERM (from vfs_create/mknod/mkdir) if the name already exists, even if we have no write access in parent. Since the check in nfs_permission is clearly not enough to stave off this, just throw in the extra READ access to actually go through. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> --- fs/nfs/dir.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index d8015a0..8c7835b 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in /* * If we're doing an exclusive create, optimize away the lookup * but don't hash the dentry. + * This optimization only works if we can write in the parent. */ - if (nfs_is_exclusive_create(dir, flags)) + if (nfs_is_exclusive_create(dir, flags) && + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) return NULL; res = ERR_PTR(-ENOMEM); -- 2.7.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 5:53 ` [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry Oleg Drokin @ 2016-07-07 16:16 ` Trond Myklebust 2016-07-07 16:52 ` Oleg Drokin 0 siblings, 1 reply; 12+ messages in thread From: Trond Myklebust @ 2016-07-07 16:16 UTC (permalink / raw) To: Oleg Drokin Cc: Jeff Layton, Viro Alexander, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org > On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: > > It's great when we can shave an extra RPC, but not at the expense > of correctness. > We should not return EPERM (from vfs_create/mknod/mkdir) if the > name already exists, even if we have no write access in parent. > > Since the check in nfs_permission is clearly not enough to stave > off this, just throw in the extra READ access to actually > go through. > > Signed-off-by: Oleg Drokin <green@linuxhacker.ru> > --- > fs/nfs/dir.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index d8015a0..8c7835b 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in > /* > * If we're doing an exclusive create, optimize away the lookup > * but don't hash the dentry. > + * This optimization only works if we can write in the parent. > */ > - if (nfs_is_exclusive_create(dir, flags)) > + if (nfs_is_exclusive_create(dir, flags) && > + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) > return NULL; > NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. Cheers Trond ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 16:16 ` Trond Myklebust @ 2016-07-07 16:52 ` Oleg Drokin 2016-07-07 16:59 ` Trond Myklebust 0 siblings, 1 reply; 12+ messages in thread From: Oleg Drokin @ 2016-07-07 16:52 UTC (permalink / raw) To: Trond Myklebust Cc: Jeff Layton, Viro Alexander, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote: > >> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: >> >> It's great when we can shave an extra RPC, but not at the expense >> of correctness. >> We should not return EPERM (from vfs_create/mknod/mkdir) if the >> name already exists, even if we have no write access in parent. >> >> Since the check in nfs_permission is clearly not enough to stave >> off this, just throw in the extra READ access to actually >> go through. >> >> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> >> --- >> fs/nfs/dir.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >> index d8015a0..8c7835b 100644 >> --- a/fs/nfs/dir.c >> +++ b/fs/nfs/dir.c >> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in >> /* >> * If we're doing an exclusive create, optimize away the lookup >> * but don't hash the dentry. >> + * This optimization only works if we can write in the parent. >> */ >> - if (nfs_is_exclusive_create(dir, flags)) >> + if (nfs_is_exclusive_create(dir, flags) && >> + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) >> return NULL; >> > > NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. Right. This was mostly a discussion piece. The problem here is nfs_permission() returns 0 if you check for inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then some other checks in the kernel still catch on to the fact that the directory is not writeable, so we have a premature failure with EPERM and server never sees this request which breaks things. (the read-only mount is not handled as well at the moment of course and my patch does not address this issue either, but it's easier to address in the VFS, like in filename_create() or something). I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check are creates and deletes and with deletes we had a lookup already, so it already looked up the child and revalidated the parent. For creates, a revalidation still might be needed, I guess and that was the main driver behind this check? And that only when you do current dir creates, because otherwise the parent would have been revalidated in lookup? Is this the major case why that check is actually there? Just trying to see how to approach this better without breaking the applications. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 16:52 ` Oleg Drokin @ 2016-07-07 16:59 ` Trond Myklebust 2016-07-07 17:07 ` Oleg Drokin 0 siblings, 1 reply; 12+ messages in thread From: Trond Myklebust @ 2016-07-07 16:59 UTC (permalink / raw) To: Oleg Drokin Cc: Trond Myklebust, Jeff Layton, Viro Alexander, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org > On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote: > > > On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote: > >> >>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: >>> >>> It's great when we can shave an extra RPC, but not at the expense >>> of correctness. >>> We should not return EPERM (from vfs_create/mknod/mkdir) if the >>> name already exists, even if we have no write access in parent. >>> >>> Since the check in nfs_permission is clearly not enough to stave >>> off this, just throw in the extra READ access to actually >>> go through. >>> >>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> >>> --- >>> fs/nfs/dir.c | 4 +++- >>> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >>> index d8015a0..8c7835b 100644 >>> --- a/fs/nfs/dir.c >>> +++ b/fs/nfs/dir.c >>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in >>> /* >>> * If we're doing an exclusive create, optimize away the lookup >>> * but don't hash the dentry. >>> + * This optimization only works if we can write in the parent. >>> */ >>> - if (nfs_is_exclusive_create(dir, flags)) >>> + if (nfs_is_exclusive_create(dir, flags) && >>> + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) >>> return NULL; >>> >> >> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. > > Right. This was mostly a discussion piece. > The problem here is nfs_permission() returns 0 if you check for > inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then > some other checks in the kernel still catch on to the fact that the directory > is not writeable, so we have a premature failure with EPERM and server never sees > this request which breaks things. Are these VFS level checks? Which ones? > > (the read-only mount is not handled as well at the moment of course and my patch > does not address this issue either, but it's easier to address in the VFS, like > in filename_create() or something). > > I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check > are creates and deletes and with deletes we had a lookup already, so it already > looked up the child and revalidated the parent. > For creates, a revalidation still might be needed, I guess and that was the main driver > behind this check? And that only when you do current dir creates, because otherwise > the parent would have been revalidated in lookup? > Is this the major case why that check is actually there? > > Just trying to see how to approach this better without breaking the applications. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 16:59 ` Trond Myklebust @ 2016-07-07 17:07 ` Oleg Drokin 2016-07-07 17:27 ` Trond Myklebust 0 siblings, 1 reply; 12+ messages in thread From: Oleg Drokin @ 2016-07-07 17:07 UTC (permalink / raw) To: Trond Myklebust Cc: Jeff Layton, Viro Alexander, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote: > >> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote: >> >> >> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote: >> >>> >>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: >>>> >>>> It's great when we can shave an extra RPC, but not at the expense >>>> of correctness. >>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the >>>> name already exists, even if we have no write access in parent. >>>> >>>> Since the check in nfs_permission is clearly not enough to stave >>>> off this, just throw in the extra READ access to actually >>>> go through. >>>> >>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> >>>> --- >>>> fs/nfs/dir.c | 4 +++- >>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >>>> index d8015a0..8c7835b 100644 >>>> --- a/fs/nfs/dir.c >>>> +++ b/fs/nfs/dir.c >>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in >>>> /* >>>> * If we're doing an exclusive create, optimize away the lookup >>>> * but don't hash the dentry. >>>> + * This optimization only works if we can write in the parent. >>>> */ >>>> - if (nfs_is_exclusive_create(dir, flags)) >>>> + if (nfs_is_exclusive_create(dir, flags) && >>>> + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) >>>> return NULL; >>>> >>> >>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. >> >> Right. This was mostly a discussion piece. >> The problem here is nfs_permission() returns 0 if you check for >> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then >> some other checks in the kernel still catch on to the fact that the directory >> is not writeable, so we have a premature failure with EPERM and server never sees >> this request which breaks things. > > Are these VFS level checks? Which ones? Yes, VFS level I believe. For Lustre it's may_create() from vfs_mkdir() that stops us short and the Lustre patch is effective. but for NFS this must be something else and I did not trace it completely. One of the security checks, I guess? if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC) as in Lustre, that returns 0 no matter what. This is trivial to reproduce too. >> (the read-only mount is not handled as well at the moment of course and my patch >> does not address this issue either, but it's easier to address in the VFS, like >> in filename_create() or something). >> >> I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check >> are creates and deletes and with deletes we had a lookup already, so it already >> looked up the child and revalidated the parent. >> For creates, a revalidation still might be needed, I guess and that was the main driver >> behind this check? And that only when you do current dir creates, because otherwise >> the parent would have been revalidated in lookup? >> Is this the major case why that check is actually there? >> >> Just trying to see how to approach this better without breaking the applications. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 17:07 ` Oleg Drokin @ 2016-07-07 17:27 ` Trond Myklebust 2016-07-07 17:32 ` Oleg Drokin 2016-07-07 21:52 ` Oleg Drokin 0 siblings, 2 replies; 12+ messages in thread From: Trond Myklebust @ 2016-07-07 17:27 UTC (permalink / raw) To: Oleg Drokin Cc: Jeff Layton, Viro Alexander, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org > On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote: > > > On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote: > >> >>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote: >>> >>> >>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote: >>> >>>> >>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: >>>>> >>>>> It's great when we can shave an extra RPC, but not at the expense >>>>> of correctness. >>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the >>>>> name already exists, even if we have no write access in parent. >>>>> >>>>> Since the check in nfs_permission is clearly not enough to stave >>>>> off this, just throw in the extra READ access to actually >>>>> go through. >>>>> >>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> >>>>> --- >>>>> fs/nfs/dir.c | 4 +++- >>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >>>>> index d8015a0..8c7835b 100644 >>>>> --- a/fs/nfs/dir.c >>>>> +++ b/fs/nfs/dir.c >>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in >>>>> /* >>>>> * If we're doing an exclusive create, optimize away the lookup >>>>> * but don't hash the dentry. >>>>> + * This optimization only works if we can write in the parent. >>>>> */ >>>>> - if (nfs_is_exclusive_create(dir, flags)) >>>>> + if (nfs_is_exclusive_create(dir, flags) && >>>>> + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) >>>>> return NULL; >>>>> >>>> >>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. >>> >>> Right. This was mostly a discussion piece. >>> The problem here is nfs_permission() returns 0 if you check for >>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then >>> some other checks in the kernel still catch on to the fact that the directory >>> is not writeable, so we have a premature failure with EPERM and server never sees >>> this request which breaks things. >> >> Are these VFS level checks? Which ones? > > Yes, VFS level I believe. > For Lustre it's may_create() from vfs_mkdir() that stops us short > and the Lustre patch is effective. > but for NFS this must be something else and I did not trace > it completely. One of the security checks, I guess? > if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC) > as in Lustre, that returns 0 no matter what. > > This is trivial to reproduce too. So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here. > >>> (the read-only mount is not handled as well at the moment of course and my patch >>> does not address this issue either, but it's easier to address in the VFS, like >>> in filename_create() or something). >>> >>> I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check >>> are creates and deletes and with deletes we had a lookup already, so it already >>> looked up the child and revalidated the parent. >>> For creates, a revalidation still might be needed, I guess and that was the main driver >>> behind this check? And that only when you do current dir creates, because otherwise >>> the parent would have been revalidated in lookup? >>> Is this the major case why that check is actually there? >>> >>> Just trying to see how to approach this better without breaking the applications. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 17:27 ` Trond Myklebust @ 2016-07-07 17:32 ` Oleg Drokin 2016-07-07 21:52 ` Oleg Drokin 1 sibling, 0 replies; 12+ messages in thread From: Oleg Drokin @ 2016-07-07 17:32 UTC (permalink / raw) To: Trond Myklebust Cc: Jeff Layton, Viro Alexander, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org On Jul 7, 2016, at 1:27 PM, Trond Myklebust wrote: > >> On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote: >> >> >> On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote: >> >>> >>>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote: >>>> >>>> >>>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote: >>>> >>>>> >>>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: >>>>>> >>>>>> It's great when we can shave an extra RPC, but not at the expense >>>>>> of correctness. >>>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the >>>>>> name already exists, even if we have no write access in parent. >>>>>> >>>>>> Since the check in nfs_permission is clearly not enough to stave >>>>>> off this, just throw in the extra READ access to actually >>>>>> go through. >>>>>> >>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> >>>>>> --- >>>>>> fs/nfs/dir.c | 4 +++- >>>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >>>>>> index d8015a0..8c7835b 100644 >>>>>> --- a/fs/nfs/dir.c >>>>>> +++ b/fs/nfs/dir.c >>>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in >>>>>> /* >>>>>> * If we're doing an exclusive create, optimize away the lookup >>>>>> * but don't hash the dentry. >>>>>> + * This optimization only works if we can write in the parent. >>>>>> */ >>>>>> - if (nfs_is_exclusive_create(dir, flags)) >>>>>> + if (nfs_is_exclusive_create(dir, flags) && >>>>>> + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) >>>>>> return NULL; >>>>>> >>>>> >>>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. >>>> >>>> Right. This was mostly a discussion piece. >>>> The problem here is nfs_permission() returns 0 if you check for >>>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then >>>> some other checks in the kernel still catch on to the fact that the directory >>>> is not writeable, so we have a premature failure with EPERM and server never sees >>>> this request which breaks things. >>> >>> Are these VFS level checks? Which ones? >> >> Yes, VFS level I believe. >> For Lustre it's may_create() from vfs_mkdir() that stops us short >> and the Lustre patch is effective. >> but for NFS this must be something else and I did not trace >> it completely. One of the security checks, I guess? >> if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC) >> as in Lustre, that returns 0 no matter what. >> >> This is trivial to reproduce too. > > So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here. I am not sure what do you mean here. When vfs asks for inode_permission(dir, MAY_WRITE|MAY_EXEC) NFS already returns 0 (i.e. everything is ok, go ahead). But it's all futile because some other check somewhere still catches on to the fact that the directory is not writeable and does not let the mkdir to actually happen in the FS itself. >>>> (the read-only mount is not handled as well at the moment of course and my patch >>>> does not address this issue either, but it's easier to address in the VFS, like >>>> in filename_create() or something). >>>> >>>> I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check >>>> are creates and deletes and with deletes we had a lookup already, so it already >>>> looked up the child and revalidated the parent. >>>> For creates, a revalidation still might be needed, I guess and that was the main driver >>>> behind this check? And that only when you do current dir creates, because otherwise >>>> the parent would have been revalidated in lookup? >>>> Is this the major case why that check is actually there? >>>> >>>> Just trying to see how to approach this better without breaking the applications. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 17:27 ` Trond Myklebust 2016-07-07 17:32 ` Oleg Drokin @ 2016-07-07 21:52 ` Oleg Drokin 2016-07-07 23:17 ` Trond Myklebust 1 sibling, 1 reply; 12+ messages in thread From: Oleg Drokin @ 2016-07-07 21:52 UTC (permalink / raw) To: Trond Myklebust, J . Bruce Fields Cc: Jeff Layton, Viro Alexander, linux-fsdevel, linux-nfs On Jul 7, 2016, at 1:27 PM, Trond Myklebust wrote: > >> On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote: >> >> >> On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote: >> >>> >>>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote: >>>> >>>> >>>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote: >>>> >>>>> >>>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: >>>>>> >>>>>> It's great when we can shave an extra RPC, but not at the expense >>>>>> of correctness. >>>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the >>>>>> name already exists, even if we have no write access in parent. >>>>>> >>>>>> Since the check in nfs_permission is clearly not enough to stave >>>>>> off this, just throw in the extra READ access to actually >>>>>> go through. >>>>>> >>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> >>>>>> --- >>>>>> fs/nfs/dir.c | 4 +++- >>>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >>>>>> index d8015a0..8c7835b 100644 >>>>>> --- a/fs/nfs/dir.c >>>>>> +++ b/fs/nfs/dir.c >>>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in >>>>>> /* >>>>>> * If we're doing an exclusive create, optimize away the lookup >>>>>> * but don't hash the dentry. >>>>>> + * This optimization only works if we can write in the parent. >>>>>> */ >>>>>> - if (nfs_is_exclusive_create(dir, flags)) >>>>>> + if (nfs_is_exclusive_create(dir, flags) && >>>>>> + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) >>>>>> return NULL; >>>>>> >>>>> >>>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. >>>> >>>> Right. This was mostly a discussion piece. >>>> The problem here is nfs_permission() returns 0 if you check for >>>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then >>>> some other checks in the kernel still catch on to the fact that the directory >>>> is not writeable, so we have a premature failure with EPERM and server never sees >>>> this request which breaks things. >>> >>> Are these VFS level checks? Which ones? >> >> Yes, VFS level I believe. >> For Lustre it's may_create() from vfs_mkdir() that stops us short >> and the Lustre patch is effective. >> but for NFS this must be something else and I did not trace >> it completely. One of the security checks, I guess? >> if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC) >> as in Lustre, that returns 0 no matter what. >> >> This is trivial to reproduce too. > > So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here. You know, I have been wrong here. VFS is not in the way. it's the NFS server that's returning us the error: [ 326.069066] NFS: mkdir(0:44/12), test [ 326.070510] nfsd_dispatch: vers 3 proc 9 [ 326.071612] nfsd: MKDIR(3) 16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000 test [ 326.071617] nfsd: fh_verify(16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000) [ 326.071684] fh_verify: /racer permission failure, acc=3, error=13 So I guess nfsd wants to be lazy too? What side do you think this is best fixed then? the client or the server? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry 2016-07-07 21:52 ` Oleg Drokin @ 2016-07-07 23:17 ` Trond Myklebust 0 siblings, 0 replies; 12+ messages in thread From: Trond Myklebust @ 2016-07-07 23:17 UTC (permalink / raw) To: Oleg Drokin Cc: Fields Bruce, Jeff Layton, Viro Alexander, <linux-fsdevel@vger.kernel.org>, Linux NFS Mailing List > On Jul 7, 2016, at 17:52, Oleg Drokin <green@linuxhacker.ru> wrote: > > > On Jul 7, 2016, at 1:27 PM, Trond Myklebust wrote: > >> >>> On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote: >>> >>> >>> On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote: >>> >>>> >>>>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote: >>>>> >>>>> >>>>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote: >>>>> >>>>>> >>>>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote: >>>>>>> >>>>>>> It's great when we can shave an extra RPC, but not at the expense >>>>>>> of correctness. >>>>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the >>>>>>> name already exists, even if we have no write access in parent. >>>>>>> >>>>>>> Since the check in nfs_permission is clearly not enough to stave >>>>>>> off this, just throw in the extra READ access to actually >>>>>>> go through. >>>>>>> >>>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> >>>>>>> --- >>>>>>> fs/nfs/dir.c | 4 +++- >>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >>>>>>> index d8015a0..8c7835b 100644 >>>>>>> --- a/fs/nfs/dir.c >>>>>>> +++ b/fs/nfs/dir.c >>>>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in >>>>>>> /* >>>>>>> * If we're doing an exclusive create, optimize away the lookup >>>>>>> * but don't hash the dentry. >>>>>>> + * This optimization only works if we can write in the parent. >>>>>>> */ >>>>>>> - if (nfs_is_exclusive_create(dir, flags)) >>>>>>> + if (nfs_is_exclusive_create(dir, flags) && >>>>>>> + (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0)) >>>>>>> return NULL; >>>>>>> >>>>>> >>>>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server. >>>>> >>>>> Right. This was mostly a discussion piece. >>>>> The problem here is nfs_permission() returns 0 if you check for >>>>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then >>>>> some other checks in the kernel still catch on to the fact that the directory >>>>> is not writeable, so we have a premature failure with EPERM and server never sees >>>>> this request which breaks things. >>>> >>>> Are these VFS level checks? Which ones? >>> >>> Yes, VFS level I believe. >>> For Lustre it's may_create() from vfs_mkdir() that stops us short >>> and the Lustre patch is effective. >>> but for NFS this must be something else and I did not trace >>> it completely. One of the security checks, I guess? >>> if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC) >>> as in Lustre, that returns 0 no matter what. >>> >>> This is trivial to reproduce too. >> >> So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here. > > You know, I have been wrong here. > VFS is not in the way. > it's the NFS server that's returning us the error: > [ 326.069066] NFS: mkdir(0:44/12), test > [ 326.070510] nfsd_dispatch: vers 3 proc 9 > [ 326.071612] nfsd: MKDIR(3) 16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000 test > [ 326.071617] nfsd: fh_verify(16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000) > [ 326.071684] fh_verify: /racer permission failure, acc=3, error=13 > > So I guess nfsd wants to be lazy too? > What side do you think this is best fixed then? the client or the server? If the bug lies in the knfsd server, then we should fix it there. The Linux client isn�t the only one out there, and those other clients should also be able to depend on correct semantics from the server. Cheers Trond ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] staging/lustre: Prevent spurious EPERM on mkdir 2016-07-07 5:53 [RFC] [PATCH 0/2] mkdir lookup optimization Oleg Drokin 2016-07-07 5:53 ` [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry Oleg Drokin @ 2016-07-07 5:53 ` Oleg Drokin 2016-07-07 10:53 ` [RFC] [PATCH 0/2] mkdir lookup optimization Jeff Layton 2 siblings, 0 replies; 12+ messages in thread From: Oleg Drokin @ 2016-07-07 5:53 UTC (permalink / raw) To: Trond Myklebust, Jeff Layton, Al Viro Cc: linux-fsdevel, linux-nfs, Oleg Drokin if the name already exists, but we don't have write permissions in the parent Signed-off-by: Oleg Drokin <green@linuxhacker.ru> --- drivers/staging/lustre/lustre/llite/namei.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c index 9321bd8..4c3dde1 100644 --- a/drivers/staging/lustre/lustre/llite/namei.c +++ b/drivers/staging/lustre/lustre/llite/namei.c @@ -544,8 +544,12 @@ static struct dentry *ll_lookup_nd(struct inode *parent, struct dentry *dentry, CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir="DFID"(%p),flags=%u\n", dentry, PFID(ll_inode2fid(parent)), parent, flags); - /* Optimize away (CREATE && !OPEN). Let .create handle the race. */ - if ((flags & LOOKUP_CREATE) && !(flags & LOOKUP_OPEN)) + /* Optimize away (CREATE && !OPEN). Let .create handle the race. + * but only if we have write permissions there, otherwise we need + * to proceed with lookup. LU-4185 + */ + if ((flags & LOOKUP_CREATE) && !(flags & LOOKUP_OPEN) && + (inode_permission(parent, MAY_WRITE | MAY_EXEC) == 0)) return NULL; if (flags & (LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE)) -- 2.7.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC] [PATCH 0/2] mkdir lookup optimization 2016-07-07 5:53 [RFC] [PATCH 0/2] mkdir lookup optimization Oleg Drokin 2016-07-07 5:53 ` [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry Oleg Drokin 2016-07-07 5:53 ` [PATCH 2/2] staging/lustre: Prevent spurious EPERM on mkdir Oleg Drokin @ 2016-07-07 10:53 ` Jeff Layton 2 siblings, 0 replies; 12+ messages in thread From: Jeff Layton @ 2016-07-07 10:53 UTC (permalink / raw) To: Oleg Drokin, Trond Myklebust, Al Viro; +Cc: linux-fsdevel, linux-nfs On Thu, 2016-07-07 at 01:53 -0400, Oleg Drokin wrote: > (sorry for resend, the first go around did not make it to fsdevel and to Al). > > This is inspired by a bug in Lustre that's ATM is shared by NFS > and used o be shared by CIFS code. > > The problem at hand is: when you try to mkdir in a directory > where you do not have permissions to create anything, you only > supposed to get EPERM if the directory you are creatign does not exist. > Now if the name does exist, you are supposed to get EEXIST instead. > There are tons of programs that when fed a pathname go and try > to perform a create of every path component starting from /, > and ignoring EEXIST, but not other errors. Those programs are broken > by the above mentioned bug. > > All is fine everywhere by Lustre and NFS at the moment, because > there's an optimization at hand. e.g. in NFS: > /* > * If we're doing an exclusive create, optimize away the lookup > * but don't hash the dentry. > */ > if (nfs_is_exclusive_create(dir, flags)) > return NULL; > > Now, this is all fine except when you have no permissions to create > anything - then vfs_mknod/mkdir/create will do may_create(dir, dentry) > and we exit spuriously with EPERM. > > [green@fedora1 crash]$ mkdir aaa > mkdir: cannot create directory 'aaa': Permission denied > [green@fedora1 crash]$ mkdir lost+found > mkdir: cannot create directory 'lost+found': Permission denied > [green@fedora1 crash]$ ls -ld lost+found > drwx------ 2 root root 16384 May 25 2013 lost+found > [green@fedora1 crash]$ mkdir lost+found > mkdir: cannot create directory 'lost+found': File exists > > cifs had exactly the same code, but it got removed when atomic_open > was introduced (throwing away a perfectly good optimization for mkdir > in process) with commit d2c127197dfc0b2bae62a52e1e0d3e3ff493919e > "cifs: implement i_op->atomic_open()" > > These two patches are the lazy way of fixing the problem - > "just throw in the extra permission check before bailing out" > with a bit of complication on the NFS side because there > the inode permission check is actually circumvented in nfs_permission, > for MAY_WRITE | !MAY_READ case which is enough to fool > may_create, but not enough to fool some following check, I guess > as the problem still exists. > (I am not sure of the performance implications of just removing that > thing in nfs_permission). > > Anyway I think instead of resurrecting this optimization for cifs, > and seeing if ceph and others need it, why not bring it up > all the way to __lookup_hash() so that we don't do actual lookup > if the parent is writeable? > > Even for local filesystems like ext4 that's of benefit - we save > one lookup (even with hashed dirs, that only gives us the last blook > to lookat and then we still need to check all names to make sure > the one we want does not exist - so it's not exactly free). > > This should not upset any sort of client-side SELinux/other security > stuff magic either. If the name exists, we get EEXIST no matter what, > if it does not exist, parent policy declares if we can create or not > anyway. > > Something like this (+ whatever nfs_permission fix): > diff --git a/fs/namei.c b/fs/namei.c > index 70580ab..b9de645 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1512,6 +1512,10 @@ static struct dentry *__lookup_hash(const struct qstr *name, > if (unlikely(!dentry)) > return ERR_PTR(-ENOMEM); > > + if ((flags & LOOKUP_EXCL|LOOKUP_CREATE) && > + (may_create(base, dentry) == 0)) > + return dentry; > + That would need to check that LOOKUP_EXCL is actually set. I think you want something like: (flags & (LOOKUP_EXCL|LOOKUP_CREATE)) == (LOOKUP_EXCL|LOOKUP_CREATE) ...and you'd have to figure out how to determine the isdir param for may_create at that point. That said, it does seem like a reasonable idea at first glance. > return lookup_real(base->d_inode, dentry, flags); > } > > Comments? > > Oleg Drokin (2): > nfs: Fix spurios EPERM when mkdir of existing dentry > staging/lustre: Prevent spurious EPERM on mkdir > > drivers/staging/lustre/lustre/llite/namei.c | 8 ++++++-- > fs/nfs/dir.c | 4 +++- > 2 files changed, 9 insertions(+), 3 deletions(-) > -- Jeff Layton <jlayton@poochiereds.net> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-07-07 23:17 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-07-07 5:53 [RFC] [PATCH 0/2] mkdir lookup optimization Oleg Drokin 2016-07-07 5:53 ` [PATCH 1/2] nfs: Fix spurios EPERM when mkdir of existing dentry Oleg Drokin 2016-07-07 16:16 ` Trond Myklebust 2016-07-07 16:52 ` Oleg Drokin 2016-07-07 16:59 ` Trond Myklebust 2016-07-07 17:07 ` Oleg Drokin 2016-07-07 17:27 ` Trond Myklebust 2016-07-07 17:32 ` Oleg Drokin 2016-07-07 21:52 ` Oleg Drokin 2016-07-07 23:17 ` Trond Myklebust 2016-07-07 5:53 ` [PATCH 2/2] staging/lustre: Prevent spurious EPERM on mkdir Oleg Drokin 2016-07-07 10:53 ` [RFC] [PATCH 0/2] mkdir lookup optimization Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).