* [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
@ 2019-09-03 11:58 Ritesh Harjani
2019-09-03 12:59 ` Gao Xiang
0 siblings, 1 reply; 6+ messages in thread
From: Ritesh Harjani @ 2019-09-03 11:58 UTC (permalink / raw)
To: linux-fsdevel, Alexander Viro; +Cc: aneesh.kumar, Jeff Layton, wugyuan
Hi Viro/All,
Could you please review below issue and it's proposed solutions.
If you could let me know which of the two you think will be a better
approach to solve this or in case if you have any other better approach,
I can prepare and submit a official patch with that.
Issue signature:-
[NIP : trailing_symlink+80]
[LR : trailing_symlink+1092]
#4 [c00000198069bb70] trailing_symlink at c0000000004bae60 (unreliable)
#5 [c00000198069bc00] path_openat at c0000000004bdd14
#6 [c00000198069bc90] do_filp_open at c0000000004c0274
#7 [c00000198069bdb0] do_sys_open at c00000000049b248
#8 [c00000198069be30] system_call at c00000000000b388
Test case:-
shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
/gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
Problem description:-
In some filesystems like GPFS below described scenario may happen on
some platforms (Reported-By:- wugyuan)
Here, two threads are being run in 2 different shells. Thread-1(cat)
does cat of the symlink and Thread-2(ln) is creating the symlink.
Now on any platform with GPFS like filesystem, if CPU does out-of-order
execution (or any kind of re-ordering due compiler optimization?) in
function __d_set_and_inode_type(), then we see a NULL pointer
dereference due to inode->i_uid.
This happens because in lookup_fast in nonRCU path or say REF-walk (i.e.
in else condition), we check d_is_negative() without any lock protection.
And since in __d_set_and_inode_type() re-ordering may happen in setting
of dentry->type & dentry->inode => this means that there is this tiny
window where things are going wrong.
(GPFS like):- Any FS with -inode_operations ->permission callback
returning -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this
problem to happen. (few e.g. found were - ocfs2, ceph, coda, afs)
int xxx_permission(struct inode *inode, int mask)
{
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
<...>
}
Wugyuan(cc), could reproduce this problem with GPFS filesystem.
Since, I didn't have the GPFS setup, so I tried replicating on a native
FS by forcing out-of-order execution in function
__d_set_inode_and_type() and making sure we return -ECHILD in
MAY_NOT_BLOCK case in ->permission operation for all inodes.
With above changes in kernel, I could as well hit this issue on a native
FS too.
(basically what we observed is link_path_walk will do nonRCU(REF-walk)
lookup due to may_lookup -> inode_permission return -ECHILD and then
unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race
is possible).
Sequence of events:-
Thread-2(Comm: ln) Thread-1(Comm: cat)
dentry = __d_lookup() //nonRCU
__d_set_and_inode_type() (Out-of-order execution)
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
WRITE_ONCE(dentry->d_flags, flags);
if (unlikely(d_is_negative()) // fails
{}
// since type is already updated in
// Thread-2 in parallel but inode
// not yet set.
// d_is_negative returns false
*inode = d_backing_inode(path->dentry);
// means inode is still NULL
dentry->d_inode = inode;
trailing_symlink()
may_follow_link()
inode = nd->link_inode;
// nd->link_inode = NULL
//Then it crashes while
//doing inode->i_uid
Approach-1:- using wmb()
diff --git a/fs/dcache.c b/fs/dcache.c
index e88cf0554e65..966172df77ee 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,7 @@ static inline void __d_set_inode_and_type(struct
dentry *dentry,
unsigned flags;
dentry->d_inode = inode;
+ smp_wmb();
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
Approach-2:- using spin_lock(&dentry->d_lock);
Do you think lock should be a better approach, given that we are already
in REF-walk mode. As per the Documentation, we should be able to take
spin_lock(&dentry->d_lock) in Ref-walk mode whenever required?
With smp_wmb(), if added, should add a small latency in both
RCU/REF-walk. But should be able to cover all the cases in general
related to dentry->type & dentry->inode ordering. Though, we don't have
any other race conditions reported or tested, other than the one,
mentioned in this email.
Confused :(
diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..a3145594da1c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1557,6 +1557,7 @@ static int lookup_fast(struct nameidata *nd,
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
int err;
+ bool negative;
/*
* Rename seqlock is not required here because in the off chance
@@ -1565,7 +1566,6 @@ static int lookup_fast(struct nameidata *nd,
*/
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
- bool negative;
dentry = __d_lookup_rcu(parent, &nd->last, &seq);
if (unlikely(!dentry)) {
if (unlazy_walk(nd))
@@ -1623,7 +1623,11 @@ static int lookup_fast(struct nameidata *nd,
dput(dentry);
return status;
}
- if (unlikely(d_is_negative(dentry))) {
+
+ spin_lock(&dentry->d_lock);
+ negative = d_is_negative(dentry);
+ spin_unlock(&dentry->d_lock);
+ if (unlikely(negative)) {
dput(dentry);
return -ENOENT;
}
Regards
Ritesh
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
2019-09-03 11:58 [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink Ritesh Harjani
@ 2019-09-03 12:59 ` Gao Xiang
2019-09-03 13:41 ` Ritesh Harjani
0 siblings, 1 reply; 6+ messages in thread
From: Gao Xiang @ 2019-09-03 12:59 UTC (permalink / raw)
To: Ritesh Harjani
Cc: linux-fsdevel, Alexander Viro, aneesh.kumar, Jeff Layton, wugyuan
On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
> Hi Viro/All,
>
> Could you please review below issue and it's proposed solutions.
> If you could let me know which of the two you think will be a better
> approach to solve this or in case if you have any other better approach, I
> can prepare and submit a official patch with that.
>
>
>
> Issue signature:-
> [NIP : trailing_symlink+80]
> [LR : trailing_symlink+1092]
> #4 [c00000198069bb70] trailing_symlink at c0000000004bae60 (unreliable)
> #5 [c00000198069bc00] path_openat at c0000000004bdd14
> #6 [c00000198069bc90] do_filp_open at c0000000004c0274
> #7 [c00000198069bdb0] do_sys_open at c00000000049b248
> #8 [c00000198069be30] system_call at c00000000000b388
>
>
>
> Test case:-
> shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
> shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
> /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
>
>
>
> Problem description:-
> In some filesystems like GPFS below described scenario may happen on some
> platforms (Reported-By:- wugyuan)
>
> Here, two threads are being run in 2 different shells. Thread-1(cat) does
> cat of the symlink and Thread-2(ln) is creating the symlink.
>
> Now on any platform with GPFS like filesystem, if CPU does out-of-order
> execution (or any kind of re-ordering due compiler optimization?) in
> function __d_set_and_inode_type(), then we see a NULL pointer dereference
> due to inode->i_uid.
>
> This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
> else condition), we check d_is_negative() without any lock protection.
> And since in __d_set_and_inode_type() re-ordering may happen in setting of
> dentry->type & dentry->inode => this means that there is this tiny window
> where things are going wrong.
>
>
> (GPFS like):- Any FS with -inode_operations ->permission callback returning
> -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
> (few e.g. found were - ocfs2, ceph, coda, afs)
>
> int xxx_permission(struct inode *inode, int mask)
> {
> if (mask & MAY_NOT_BLOCK)
> return -ECHILD;
> <...>
> }
>
> Wugyuan(cc), could reproduce this problem with GPFS filesystem.
> Since, I didn't have the GPFS setup, so I tried replicating on a native FS
> by forcing out-of-order execution in function __d_set_inode_and_type() and
> making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
> operation for all inodes.
>
> With above changes in kernel, I could as well hit this issue on a native FS
> too.
>
> (basically what we observed is link_path_walk will do nonRCU(REF-walk)
> lookup due to may_lookup -> inode_permission return -ECHILD and then
> unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
> possible).
>
>
>
> Sequence of events:-
>
> Thread-2(Comm: ln) Thread-1(Comm: cat)
>
> dentry = __d_lookup() //nonRCU
>
> __d_set_and_inode_type() (Out-of-order execution)
> flags = READ_ONCE(dentry->d_flags);
> flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> flags |= type_flags;
> WRITE_ONCE(dentry->d_flags, flags);
>
>
> if (unlikely(d_is_negative()) // fails
> {}
> // since type is already updated in
> // Thread-2 in parallel but inode
> // not yet set.
> // d_is_negative returns false
>
> *inode = d_backing_inode(path->dentry);
> // means inode is still NULL
>
> dentry->d_inode = inode;
>
> trailing_symlink()
> may_follow_link()
> inode = nd->link_inode;
> // nd->link_inode = NULL
> //Then it crashes while
> //doing inode->i_uid
>
>
It seems much similar to
https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
Thanks,
Gao Xiang
>
>
>
> Approach-1:- using wmb()
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index e88cf0554e65..966172df77ee 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -316,6 +316,7 @@ static inline void __d_set_inode_and_type(struct dentry
> *dentry,
> unsigned flags;
>
> dentry->d_inode = inode;
> + smp_wmb();
> flags = READ_ONCE(dentry->d_flags);
> flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> flags |= type_flags;
>
>
>
> Approach-2:- using spin_lock(&dentry->d_lock);
>
> Do you think lock should be a better approach, given that we are already
> in REF-walk mode. As per the Documentation, we should be able to take
> spin_lock(&dentry->d_lock) in Ref-walk mode whenever required?
>
>
> With smp_wmb(), if added, should add a small latency in both
> RCU/REF-walk. But should be able to cover all the cases in general related
> to dentry->type & dentry->inode ordering. Though, we don't have any other
> race conditions reported or tested, other than the one, mentioned in this
> email.
>
> Confused :(
>
>
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..a3145594da1c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1557,6 +1557,7 @@ static int lookup_fast(struct nameidata *nd,
> struct dentry *dentry, *parent = nd->path.dentry;
> int status = 1;
> int err;
> + bool negative;
>
> /*
> * Rename seqlock is not required here because in the off chance
> @@ -1565,7 +1566,6 @@ static int lookup_fast(struct nameidata *nd,
> */
> if (nd->flags & LOOKUP_RCU) {
> unsigned seq;
> - bool negative;
> dentry = __d_lookup_rcu(parent, &nd->last, &seq);
> if (unlikely(!dentry)) {
> if (unlazy_walk(nd))
> @@ -1623,7 +1623,11 @@ static int lookup_fast(struct nameidata *nd,
> dput(dentry);
> return status;
> }
> - if (unlikely(d_is_negative(dentry))) {
> +
> + spin_lock(&dentry->d_lock);
> + negative = d_is_negative(dentry);
> + spin_unlock(&dentry->d_lock);
> + if (unlikely(negative)) {
> dput(dentry);
> return -ENOENT;
> }
>
>
> Regards
> Ritesh
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
2019-09-03 12:59 ` Gao Xiang
@ 2019-09-03 13:41 ` Ritesh Harjani
2019-09-03 13:58 ` Gao Xiang
2019-09-04 14:39 ` Jeff Layton
0 siblings, 2 replies; 6+ messages in thread
From: Ritesh Harjani @ 2019-09-03 13:41 UTC (permalink / raw)
To: Gao Xiang, Alexander Viro
Cc: linux-fsdevel, aneesh.kumar, Jeff Layton, wugyuan
On 9/3/19 6:29 PM, Gao Xiang wrote:
> On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
>> Hi Viro/All,
>>
>> Could you please review below issue and it's proposed solutions.
>> If you could let me know which of the two you think will be a better
>> approach to solve this or in case if you have any other better approach, I
>> can prepare and submit a official patch with that.
>>
>>
>>
>> Issue signature:-
>> [NIP : trailing_symlink+80]
>> [LR : trailing_symlink+1092]
>> #4 [c00000198069bb70] trailing_symlink at c0000000004bae60 (unreliable)
>> #5 [c00000198069bc00] path_openat at c0000000004bdd14
>> #6 [c00000198069bc90] do_filp_open at c0000000004c0274
>> #7 [c00000198069bdb0] do_sys_open at c00000000049b248
>> #8 [c00000198069be30] system_call at c00000000000b388
>>
>>
>>
>> Test case:-
>> shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
>> shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
>> /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
>>
>>
>>
>> Problem description:-
>> In some filesystems like GPFS below described scenario may happen on some
>> platforms (Reported-By:- wugyuan)
>>
>> Here, two threads are being run in 2 different shells. Thread-1(cat) does
>> cat of the symlink and Thread-2(ln) is creating the symlink.
>>
>> Now on any platform with GPFS like filesystem, if CPU does out-of-order
>> execution (or any kind of re-ordering due compiler optimization?) in
>> function __d_set_and_inode_type(), then we see a NULL pointer dereference
>> due to inode->i_uid.
>>
>> This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
>> else condition), we check d_is_negative() without any lock protection.
>> And since in __d_set_and_inode_type() re-ordering may happen in setting of
>> dentry->type & dentry->inode => this means that there is this tiny window
>> where things are going wrong.
>>
>>
>> (GPFS like):- Any FS with -inode_operations ->permission callback returning
>> -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
>> (few e.g. found were - ocfs2, ceph, coda, afs)
>>
>> int xxx_permission(struct inode *inode, int mask)
>> {
>> if (mask & MAY_NOT_BLOCK)
>> return -ECHILD;
>> <...>
>> }
>>
>> Wugyuan(cc), could reproduce this problem with GPFS filesystem.
>> Since, I didn't have the GPFS setup, so I tried replicating on a native FS
>> by forcing out-of-order execution in function __d_set_inode_and_type() and
>> making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
>> operation for all inodes.
>>
>> With above changes in kernel, I could as well hit this issue on a native FS
>> too.
>>
>> (basically what we observed is link_path_walk will do nonRCU(REF-walk)
>> lookup due to may_lookup -> inode_permission return -ECHILD and then
>> unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
>> possible).
>>
>>
>>
>> Sequence of events:-
>>
>> Thread-2(Comm: ln) Thread-1(Comm: cat)
>>
>> dentry = __d_lookup() //nonRCU
>>
>> __d_set_and_inode_type() (Out-of-order execution)
>> flags = READ_ONCE(dentry->d_flags);
>> flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>> flags |= type_flags;
>> WRITE_ONCE(dentry->d_flags, flags);
>>
>>
>> if (unlikely(d_is_negative()) // fails
>> {}
>> // since type is already updated in
>> // Thread-2 in parallel but inode
>> // not yet set.
>> // d_is_negative returns false
>>
>> *inode = d_backing_inode(path->dentry);
>> // means inode is still NULL
>>
>> dentry->d_inode = inode;
>>
>> trailing_symlink()
>> may_follow_link()
>> inode = nd->link_inode;
>> // nd->link_inode = NULL
>> //Then it crashes while
>> //doing inode->i_uid
>>
>>
>
> It seems much similar to
> https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
Thanks, yes two same symptoms with different use cases.
But except the fact that here, we see the issue with GPFS quite
frequently. So let's hope that we could have some solution to this
problem in upstream.
From the thread:-
>> We could simply use d_really_is_negative() there, avoiding all that
>> mess. If and when we get around to whiteouts-in-dcache (i.e. if
>> unionfs series gets resurrected), we can revisit that
I didn't get this part. Does it mean, d_really_is_negative can only be
used, once whiteouts-in-dcache series is resurrected?
If yes, meanwhile could we have any other solution in place?
-ritesh
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
2019-09-03 13:41 ` Ritesh Harjani
@ 2019-09-03 13:58 ` Gao Xiang
2019-09-04 14:39 ` Jeff Layton
1 sibling, 0 replies; 6+ messages in thread
From: Gao Xiang @ 2019-09-03 13:58 UTC (permalink / raw)
To: Ritesh Harjani
Cc: Alexander Viro, linux-fsdevel, aneesh.kumar, Jeff Layton, wugyuan
On Tue, Sep 03, 2019 at 07:11:28PM +0530, Ritesh Harjani wrote:
>
>
> On 9/3/19 6:29 PM, Gao Xiang wrote:
> > On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
> > > Hi Viro/All,
> > >
> > > Could you please review below issue and it's proposed solutions.
> > > If you could let me know which of the two you think will be a better
> > > approach to solve this or in case if you have any other better approach, I
> > > can prepare and submit a official patch with that.
> > >
> > >
> > >
> > > Issue signature:-
> > > [NIP : trailing_symlink+80]
> > > [LR : trailing_symlink+1092]
> > > #4 [c00000198069bb70] trailing_symlink at c0000000004bae60 (unreliable)
> > > #5 [c00000198069bc00] path_openat at c0000000004bdd14
> > > #6 [c00000198069bc90] do_filp_open at c0000000004c0274
> > > #7 [c00000198069bdb0] do_sys_open at c00000000049b248
> > > #8 [c00000198069be30] system_call at c00000000000b388
> > >
> > >
> > >
> > > Test case:-
> > > shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
> > > shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
> > > /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
> > >
> > >
> > >
> > > Problem description:-
> > > In some filesystems like GPFS below described scenario may happen on some
> > > platforms (Reported-By:- wugyuan)
> > >
> > > Here, two threads are being run in 2 different shells. Thread-1(cat) does
> > > cat of the symlink and Thread-2(ln) is creating the symlink.
> > >
> > > Now on any platform with GPFS like filesystem, if CPU does out-of-order
> > > execution (or any kind of re-ordering due compiler optimization?) in
> > > function __d_set_and_inode_type(), then we see a NULL pointer dereference
> > > due to inode->i_uid.
> > >
> > > This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
> > > else condition), we check d_is_negative() without any lock protection.
> > > And since in __d_set_and_inode_type() re-ordering may happen in setting of
> > > dentry->type & dentry->inode => this means that there is this tiny window
> > > where things are going wrong.
> > >
> > >
> > > (GPFS like):- Any FS with -inode_operations ->permission callback returning
> > > -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
> > > (few e.g. found were - ocfs2, ceph, coda, afs)
> > >
> > > int xxx_permission(struct inode *inode, int mask)
> > > {
> > > if (mask & MAY_NOT_BLOCK)
> > > return -ECHILD;
> > > <...>
> > > }
> > >
> > > Wugyuan(cc), could reproduce this problem with GPFS filesystem.
> > > Since, I didn't have the GPFS setup, so I tried replicating on a native FS
> > > by forcing out-of-order execution in function __d_set_inode_and_type() and
> > > making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
> > > operation for all inodes.
> > >
> > > With above changes in kernel, I could as well hit this issue on a native FS
> > > too.
> > >
> > > (basically what we observed is link_path_walk will do nonRCU(REF-walk)
> > > lookup due to may_lookup -> inode_permission return -ECHILD and then
> > > unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
> > > possible).
> > >
> > >
> > >
> > > Sequence of events:-
> > >
> > > Thread-2(Comm: ln) Thread-1(Comm: cat)
> > >
> > > dentry = __d_lookup() //nonRCU
> > >
> > > __d_set_and_inode_type() (Out-of-order execution)
> > > flags = READ_ONCE(dentry->d_flags);
> > > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> > > flags |= type_flags;
> > > WRITE_ONCE(dentry->d_flags, flags);
> > >
> > >
> > > if (unlikely(d_is_negative()) // fails
> > > {}
> > > // since type is already updated in
> > > // Thread-2 in parallel but inode
> > > // not yet set.
> > > // d_is_negative returns false
> > >
> > > *inode = d_backing_inode(path->dentry);
> > > // means inode is still NULL
> > >
> > > dentry->d_inode = inode;
> > >
> > > trailing_symlink()
> > > may_follow_link()
> > > inode = nd->link_inode;
> > > // nd->link_inode = NULL
> > > //Then it crashes while
> > > //doing inode->i_uid
> > >
> > >
> >
> > It seems much similar to
> > https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
>
> Thanks, yes two same symptoms with different use cases.
> But except the fact that here, we see the issue with GPFS quite frequently.
> So let's hope that we could have some solution to this problem in upstream.
>
> From the thread:-
> >> We could simply use d_really_is_negative() there, avoiding all that
> >> mess. If and when we get around to whiteouts-in-dcache (i.e. if
> >> unionfs series gets resurrected), we can revisit that
>
> I didn't get this part. Does it mean, d_really_is_negative can only be used,
> once whiteouts-in-dcache series is resurrected?
> If yes, meanwhile could we have any other solution in place?
In my own premature opinion, I think it's some complicated about
the coexistence of d_is_negative() and d_really_is_negative(),
and handle both d_flags and d_inode stuffs for negative dentries...
No constructive idea here... Just same case found by our colleagues...
Thanks,
Gao Xiang
>
> -ritesh
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
2019-09-03 13:41 ` Ritesh Harjani
2019-09-03 13:58 ` Gao Xiang
@ 2019-09-04 14:39 ` Jeff Layton
2019-09-06 5:17 ` Ritesh Harjani
1 sibling, 1 reply; 6+ messages in thread
From: Jeff Layton @ 2019-09-04 14:39 UTC (permalink / raw)
To: Ritesh Harjani, Gao Xiang, Alexander Viro
Cc: linux-fsdevel, aneesh.kumar, wugyuan
On Tue, 2019-09-03 at 19:11 +0530, Ritesh Harjani wrote:
>
> On 9/3/19 6:29 PM, Gao Xiang wrote:
> > On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
> > > Hi Viro/All,
> > >
> > > Could you please review below issue and it's proposed solutions.
> > > If you could let me know which of the two you think will be a better
> > > approach to solve this or in case if you have any other better approach, I
> > > can prepare and submit a official patch with that.
> > >
> > >
> > >
> > > Issue signature:-
> > > [NIP : trailing_symlink+80]
> > > [LR : trailing_symlink+1092]
> > > #4 [c00000198069bb70] trailing_symlink at c0000000004bae60 (unreliable)
> > > #5 [c00000198069bc00] path_openat at c0000000004bdd14
> > > #6 [c00000198069bc90] do_filp_open at c0000000004c0274
> > > #7 [c00000198069bdb0] do_sys_open at c00000000049b248
> > > #8 [c00000198069be30] system_call at c00000000000b388
> > >
> > >
> > >
> > > Test case:-
> > > shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
> > > shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
> > > /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
> > >
> > >
> > >
> > > Problem description:-
> > > In some filesystems like GPFS below described scenario may happen on some
> > > platforms (Reported-By:- wugyuan)
> > >
> > > Here, two threads are being run in 2 different shells. Thread-1(cat) does
> > > cat of the symlink and Thread-2(ln) is creating the symlink.
> > >
> > > Now on any platform with GPFS like filesystem, if CPU does out-of-order
> > > execution (or any kind of re-ordering due compiler optimization?) in
> > > function __d_set_and_inode_type(), then we see a NULL pointer dereference
> > > due to inode->i_uid.
> > >
> > > This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
> > > else condition), we check d_is_negative() without any lock protection.
> > > And since in __d_set_and_inode_type() re-ordering may happen in setting of
> > > dentry->type & dentry->inode => this means that there is this tiny window
> > > where things are going wrong.
> > >
> > >
> > > (GPFS like):- Any FS with -inode_operations ->permission callback returning
> > > -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
> > > (few e.g. found were - ocfs2, ceph, coda, afs)
> > >
> > > int xxx_permission(struct inode *inode, int mask)
> > > {
> > > if (mask & MAY_NOT_BLOCK)
> > > return -ECHILD;
> > > <...>
> > > }
> > >
> > > Wugyuan(cc), could reproduce this problem with GPFS filesystem.
> > > Since, I didn't have the GPFS setup, so I tried replicating on a native FS
> > > by forcing out-of-order execution in function __d_set_inode_and_type() and
> > > making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
> > > operation for all inodes.
> > >
> > > With above changes in kernel, I could as well hit this issue on a native FS
> > > too.
> > >
> > > (basically what we observed is link_path_walk will do nonRCU(REF-walk)
> > > lookup due to may_lookup -> inode_permission return -ECHILD and then
> > > unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
> > > possible).
> > >
> > >
> > >
> > > Sequence of events:-
> > >
> > > Thread-2(Comm: ln) Thread-1(Comm: cat)
> > >
> > > dentry = __d_lookup() //nonRCU
> > >
> > > __d_set_and_inode_type() (Out-of-order execution)
> > > flags = READ_ONCE(dentry->d_flags);
> > > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> > > flags |= type_flags;
> > > WRITE_ONCE(dentry->d_flags, flags);
> > >
> > >
> > > if (unlikely(d_is_negative()) // fails
> > > {}
> > > // since type is already updated in
> > > // Thread-2 in parallel but inode
> > > // not yet set.
> > > // d_is_negative returns false
> > >
> > > *inode = d_backing_inode(path->dentry);
> > > // means inode is still NULL
> > >
> > > dentry->d_inode = inode;
> > >
> > > trailing_symlink()
> > > may_follow_link()
> > > inode = nd->link_inode;
> > > // nd->link_inode = NULL
> > > //Then it crashes while
> > > //doing inode->i_uid
> > >
> > >
> >
> > It seems much similar to
> > https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
>
> Thanks, yes two same symptoms with different use cases.
> But except the fact that here, we see the issue with GPFS quite
> frequently. So let's hope that we could have some solution to this
> problem in upstream.
>
Agreed. Looks a lot like the same issue.
> From the thread:-
> >> We could simply use d_really_is_negative() there, avoiding all that
> >> mess. If and when we get around to whiteouts-in-dcache (i.e. if
> >> unionfs series gets resurrected), we can revisit that
>
> I didn't get this part. Does it mean, d_really_is_negative can only be
> used, once whiteouts-in-dcache series is resurrected?
> If yes, meanwhile could we have any other solution in place?
>
I think Al was saying that you could change this to use
d_really_is_negative now but the whiteouts-in-dcache series would have
to deal with that. That series seems to be stalled for the time being,
so I wouldn't let it stop you from changing this.
It wouldn't hurt to put in some comments with this change too, to make
sure everyone understands why that's being used there.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
2019-09-04 14:39 ` Jeff Layton
@ 2019-09-06 5:17 ` Ritesh Harjani
0 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani @ 2019-09-06 5:17 UTC (permalink / raw)
To: Jeff Layton, Alexander Viro
Cc: Gao Xiang, linux-fsdevel, aneesh.kumar, wugyuan
On 9/4/19 8:09 PM, Jeff Layton wrote:
>>> It seems much similar to
>>> https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
>>
>> Thanks, yes two same symptoms with different use cases.
>> But except the fact that here, we see the issue with GPFS quite
>> frequently. So let's hope that we could have some solution to this
>> problem in upstream.
>>
>
> Agreed. Looks a lot like the same issue.
>
>> From the thread:-
>> >> We could simply use d_really_is_negative() there, avoiding all that
>> >> mess. If and when we get around to whiteouts-in-dcache (i.e. if
>> >> unionfs series gets resurrected), we can revisit that
>>
>> I didn't get this part. Does it mean, d_really_is_negative can only be
>> used, once whiteouts-in-dcache series is resurrected?
>> If yes, meanwhile could we have any other solution in place?
>>
>
> I think Al was saying that you could change this to use
> d_really_is_negative now but the whiteouts-in-dcache series would have
> to deal with that. That series seems to be stalled for the time being,
> so I wouldn't let it stop you from changing this.
>
> It wouldn't hurt to put in some comments with this change too, to make
> sure everyone understands why that's being used there.
Thanks Jeff. I did look at the thread again and yes you are right.
I have created the patch and added some comments as per your suggestion.
Once it is tested again internally (although it does looks like it
should work fine), I will post the patch.
Appreciate your help!!
-ritesh
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-09-06 5:17 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-03 11:58 [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink Ritesh Harjani
2019-09-03 12:59 ` Gao Xiang
2019-09-03 13:41 ` Ritesh Harjani
2019-09-03 13:58 ` Gao Xiang
2019-09-04 14:39 ` Jeff Layton
2019-09-06 5:17 ` Ritesh Harjani
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).