Is lockref_get_not_zero() really correct in dget

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Is lockref_get_not_zero() really correct in dget_parent()
@ 2014-08-05  3:17 NeilBrown
  2014-08-05  4:07 ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2014-08-05  3:17 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, linux-fsdevel, lkml

[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]

Hi,
 I've been looking at last year's change to dentry refcounting which sets the
 refcount to -128 (mark_dead()) when the dentry is gone.

 As this is an "unsigned long" and there are several places where
 d_lockref.count is compared e.g. "> 1", I start feeling uncomfortable, as
 "-128" is greater than "1".

 Most of them look safe because there is locking in place and
 d_lockref.count will never be seen as "-128" unless you get the reference
 with only rcu_read_lock().

 That brings me to dget_parent().  It only has rcu_read_lock() protection, and
 yet uses lockref_get_not_zero().  This doesn't seem safe.

 Is there a reason that it is safe that I'm not seeing?  Or is the following
 needed?

Thanks,
NeilBrown

Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/fs/dcache.c b/fs/dcache.c
index 06f65857a855..c48ce95a38f2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -699,7 +699,7 @@ struct dentry *dget_parent(struct dentry *dentry)
 	 */
 	rcu_read_lock();
 	ret = ACCESS_ONCE(dentry->d_parent);
-	gotref = lockref_get_not_zero(&ret->d_lockref);
+	gotref = lockref_get_not_dead(&ret->d_lockref);
 	rcu_read_unlock();
 	if (likely(gotref)) {
 		if (likely(ret == ACCESS_ONCE(dentry->d_parent)))

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Is lockref_get_not_zero() really correct in dget_parent()
  2014-08-05  3:17 Is lockref_get_not_zero() really correct in dget_parent() NeilBrown
@ 2014-08-05  4:07 ` Linus Torvalds
  2014-08-05  4:54   ` Steven Noonan
  2014-08-05  5:14   ` Al Viro
  0 siblings, 2 replies; 7+ messages in thread
From: Linus Torvalds @ 2014-08-05  4:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: Al Viro, linux-fsdevel, lkml

On Mon, Aug 4, 2014 at 8:17 PM, NeilBrown <neilb@suse.de> wrote:
>
>  I've been looking at last year's change to dentry refcounting which sets the
>  refcount to -128 (mark_dead()) when the dentry is gone.
>
>  As this is an "unsigned long" and there are several places where
>  d_lockref.count is compared e.g. "> 1", I start feeling uncomfortable, as
>  "-128" is greater than "1".

Anybody who checks the lockref count without holding the lock is
pretty much buggy by definition. And if you hold the lock, you had
better not ever see a negative (== large positive) number, because
that would be all kinds of buggy too.

So I don't *think* that people who compare with "> 1" kind of things
should be problematic. I wouldn't necessarily disagree with the notion
of making a lockref be a signed entity, though. It started out
unsigned, but it started out without that dead state too, so that
unsigned thing can be considered a historical artifact rather than any
real design decision.

Anyway, I think my argument is that anybody who actually looks at
d_count() and might see that magic dead value is so fundamentally
broken in other ways that I wouldn't worry too much about *that* part.

But your "lockref_get_not_zero()" thing is a different thing:

>  That brings me to dget_parent().  It only has rcu_read_lock() protection, and
>  yet uses lockref_get_not_zero().  This doesn't seem safe.

Yes, agreed, it's ugly and wrong, and smells bad.

But I think it happens to be safe (because the re-checking of d_parent
will fail if a rename and dput could have triggered it, and even the
extraneous "dput()" is actually safe, because it won't cause the value
to become zero, so nothing bad happens. But it *is* kind of subtle,
and I do agree that it's *needlessly* so.

So it might be a good idea to get rid of the "not zero" version
entirely, and make the check be about being *active* (ie not zero, and
not dead).

The only user of lockref_get_not_zero() is that dget_parent() thing,
so that should be easy.

So renaming it to "lockref_get_active()", and changing the "not zero"
test to check for "positive" and change the rtype of "count" to be
signed, all sound like good things to me.

But I don't actually think it's an active bug, it's just an "active
horribly ugly and subtly working code". I guess in theory if you can
get lots of CPU's triggering the race at the same time, the magic
negative number could become zero and positive, but at that point I
don't think we're really talking reality any more.

Can somebody pick holes in that? Does somebody want to send in the
cleanup patch?

          Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is lockref_get_not_zero() really correct in dget_parent()
  2014-08-05  4:07 ` Linus Torvalds
@ 2014-08-05  4:54   ` Steven Noonan
  2014-08-05  4:56     ` Steven Noonan
  2014-08-05  5:14   ` Al Viro
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Noonan @ 2014-08-05  4:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: NeilBrown, Al Viro, linux-fsdevel, lkml

On Mon, Aug 04, 2014 at 09:07:32PM -0700, Linus Torvalds wrote:
> On Mon, Aug 4, 2014 at 8:17 PM, NeilBrown <neilb@suse.de> wrote:
> >
> >  I've been looking at last year's change to dentry refcounting which sets the
> >  refcount to -128 (mark_dead()) when the dentry is gone.
> >
> >  As this is an "unsigned long" and there are several places where
> >  d_lockref.count is compared e.g. "> 1", I start feeling uncomfortable, as
> >  "-128" is greater than "1".
> 
> Anybody who checks the lockref count without holding the lock is
> pretty much buggy by definition. And if you hold the lock, you had
> better not ever see a negative (== large positive) number, because
> that would be all kinds of buggy too.
> 
> So I don't *think* that people who compare with "> 1" kind of things
> should be problematic. I wouldn't necessarily disagree with the notion
> of making a lockref be a signed entity, though. It started out
> unsigned, but it started out without that dead state too, so that
> unsigned thing can be considered a historical artifact rather than any
> real design decision.
> 
> Anyway, I think my argument is that anybody who actually looks at
> d_count() and might see that magic dead value is so fundamentally
> broken in other ways that I wouldn't worry too much about *that* part.
> 
> But your "lockref_get_not_zero()" thing is a different thing:
> 
> >  That brings me to dget_parent().  It only has rcu_read_lock() protection, and
> >  yet uses lockref_get_not_zero().  This doesn't seem safe.
> 
> Yes, agreed, it's ugly and wrong, and smells bad.
> 
> But I think it happens to be safe (because the re-checking of d_parent
> will fail if a rename and dput could have triggered it, and even the
> extraneous "dput()" is actually safe, because it won't cause the value
> to become zero, so nothing bad happens. But it *is* kind of subtle,
> and I do agree that it's *needlessly* so.
> 
> So it might be a good idea to get rid of the "not zero" version
> entirely, and make the check be about being *active* (ie not zero, and
> not dead).
> 
> The only user of lockref_get_not_zero() is that dget_parent() thing,
> so that should be easy.
> 
> So renaming it to "lockref_get_active()", and changing the "not zero"
> test to check for "positive" and change the rtype of "count" to be
> signed, all sound like good things to me.
> 
> But I don't actually think it's an active bug, it's just an "active
> horribly ugly and subtly working code". I guess in theory if you can
> get lots of CPU's triggering the race at the same time, the magic
> negative number could become zero and positive, but at that point I
> don't think we're really talking reality any more.
> 
> Can somebody pick holes in that? Does somebody want to send in the
> cleanup patch?

How does this look?


diff --git a/fs/dcache.c b/fs/dcache.c
index e99c6f5..1e7dc31 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -479,7 +479,7 @@ static void __dentry_kill(struct dentry *dentry)
 	 * dentry_iput drops the locks, at which point nobody (except
 	 * transient RCU lookups) can reach this dentry.
 	 */
-	BUG_ON((int)dentry->d_lockref.count > 0);
+	BUG_ON(dentry->d_lockref.count > 0);
 	this_cpu_dec(nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
@@ -532,7 +532,7 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 	struct dentry *parent = dentry->d_parent;
 	if (IS_ROOT(dentry))
 		return NULL;
-	if (unlikely((int)dentry->d_lockref.count < 0))
+	if (unlikely(dentry->d_lockref.count < 0))
 		return NULL;
 	if (likely(spin_trylock(&parent->d_lock)))
 		return parent;
@@ -699,7 +699,7 @@ struct dentry *dget_parent(struct dentry *dentry)
 	 */
 	rcu_read_lock();
 	ret = ACCESS_ONCE(dentry->d_parent);
-	gotref = lockref_get_not_zero(&ret->d_lockref);
+	gotref = lockref_get_active(&ret->d_lockref);
 	rcu_read_unlock();
 	if (likely(gotref)) {
 		if (likely(ret == ACCESS_ONCE(dentry->d_parent)))
@@ -848,7 +848,7 @@ static void shrink_dentry_list(struct list_head *list)
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup. Do not free it.
 		 */
-		if ((int)dentry->d_lockref.count > 0) {
+		if (dentry->d_lockref.count > 0) {
 			spin_unlock(&dentry->d_lock);
 			if (parent)
 				spin_unlock(&parent->d_lock);
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index aec7f73..d492f0e 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1745,7 +1745,7 @@ void gfs2_dump_glock(struct seq_file *seq, const struct gfs2_glock *gl)
 		  state2str(gl->gl_demote_state), dtime,
 		  atomic_read(&gl->gl_ail_count),
 		  atomic_read(&gl->gl_revokes),
-		  (int)gl->gl_lockref.count, gl->gl_hold_time);
+		  gl->gl_lockref.count, gl->gl_hold_time);
 
 	list_for_each_entry(gh, &gl->gl_holders, gh_list)
 		dump_holder(seq, gh);
diff --git a/include/linux/lockref.h b/include/linux/lockref.h
index 4bfde0e..1a9827e 100644
--- a/include/linux/lockref.h
+++ b/include/linux/lockref.h
@@ -28,13 +28,13 @@ struct lockref {
 #endif
 		struct {
 			spinlock_t lock;
-			unsigned int count;
+			int count;
 		};
 	};
 };
 
 extern void lockref_get(struct lockref *);
-extern int lockref_get_not_zero(struct lockref *);
+extern int lockref_get_active(struct lockref *);
 extern int lockref_get_or_lock(struct lockref *);
 extern int lockref_put_or_lock(struct lockref *);
 
diff --git a/lib/lockref.c b/lib/lockref.c
index f07a40d..7f30371 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -61,17 +61,18 @@ void lockref_get(struct lockref *lockref)
 EXPORT_SYMBOL(lockref_get);
 
 /**
- * lockref_get_not_zero - Increments count unless the count is 0
+ * lockref_get_active - Increments count unless the count is 0 or ref is dead
  * @lockref: pointer to lockref structure
- * Return: 1 if count updated successfully or 0 if count was zero
+ * Return: 1 if count updated successfully or 0 if count was zero or lockref
+ * was dead
  */
-int lockref_get_not_zero(struct lockref *lockref)
+int lockref_get_active(struct lockref *lockref)
 {
 	int retval;
 
 	CMPXCHG_LOOP(
 		new.count++;
-		if (!old.count)
+		if (old.count < 1)
 			return 0;
 	,
 		return 1;
@@ -79,14 +80,14 @@ int lockref_get_not_zero(struct lockref *lockref)
 
 	spin_lock(&lockref->lock);
 	retval = 0;
-	if (lockref->count) {
+	if (lockref->count >= 1) {
 		lockref->count++;
 		retval = 1;
 	}
 	spin_unlock(&lockref->lock);
 	return retval;
 }
-EXPORT_SYMBOL(lockref_get_not_zero);
+EXPORT_SYMBOL(lockref_get_active);
 
 /**
  * lockref_get_or_lock - Increments count unless the count is 0
@@ -159,7 +160,7 @@ int lockref_get_not_dead(struct lockref *lockref)
 
 	CMPXCHG_LOOP(
 		new.count++;
-		if ((int)old.count < 0)
+		if (old.count < 0)
 			return 0;
 	,
 		return 1;
@@ -167,7 +168,7 @@ int lockref_get_not_dead(struct lockref *lockref)
 
 	spin_lock(&lockref->lock);
 	retval = 0;
-	if ((int) lockref->count >= 0) {
+	if (lockref->count >= 0) {
 		lockref->count++;
 		retval = 1;
 	}


I'm not too happy with the documentation string for lockref_get_active(), but
I believe the logic is at least right. There were a few places where 'count'
was being casted to a signed integer, and since this change turns 'count' into
a signed value anyway, I've stripped out the casts.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Is lockref_get_not_zero() really correct in dget_parent()
  2014-08-05  4:54   ` Steven Noonan
@ 2014-08-05  4:56     ` Steven Noonan
  2014-08-05 16:53       ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Noonan @ 2014-08-05  4:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: NeilBrown, Al Viro, linux-fsdevel, lkml

On Mon, Aug 4, 2014 at 9:54 PM, Steven Noonan <steven@uplinklabs.net> wrote:
> On Mon, Aug 04, 2014 at 09:07:32PM -0700, Linus Torvalds wrote:
>> On Mon, Aug 4, 2014 at 8:17 PM, NeilBrown <neilb@suse.de> wrote:
>> >
>> >  I've been looking at last year's change to dentry refcounting which sets the
>> >  refcount to -128 (mark_dead()) when the dentry is gone.
>> >
>> >  As this is an "unsigned long" and there are several places where
>> >  d_lockref.count is compared e.g. "> 1", I start feeling uncomfortable, as
>> >  "-128" is greater than "1".
>>
>> Anybody who checks the lockref count without holding the lock is
>> pretty much buggy by definition. And if you hold the lock, you had
>> better not ever see a negative (== large positive) number, because
>> that would be all kinds of buggy too.
>>
>> So I don't *think* that people who compare with "> 1" kind of things
>> should be problematic. I wouldn't necessarily disagree with the notion
>> of making a lockref be a signed entity, though. It started out
>> unsigned, but it started out without that dead state too, so that
>> unsigned thing can be considered a historical artifact rather than any
>> real design decision.
>>
>> Anyway, I think my argument is that anybody who actually looks at
>> d_count() and might see that magic dead value is so fundamentally
>> broken in other ways that I wouldn't worry too much about *that* part.
>>
>> But your "lockref_get_not_zero()" thing is a different thing:
>>
>> >  That brings me to dget_parent().  It only has rcu_read_lock() protection, and
>> >  yet uses lockref_get_not_zero().  This doesn't seem safe.
>>
>> Yes, agreed, it's ugly and wrong, and smells bad.
>>
>> But I think it happens to be safe (because the re-checking of d_parent
>> will fail if a rename and dput could have triggered it, and even the
>> extraneous "dput()" is actually safe, because it won't cause the value
>> to become zero, so nothing bad happens. But it *is* kind of subtle,
>> and I do agree that it's *needlessly* so.
>>
>> So it might be a good idea to get rid of the "not zero" version
>> entirely, and make the check be about being *active* (ie not zero, and
>> not dead).
>>
>> The only user of lockref_get_not_zero() is that dget_parent() thing,
>> so that should be easy.
>>
>> So renaming it to "lockref_get_active()", and changing the "not zero"
>> test to check for "positive" and change the rtype of "count" to be
>> signed, all sound like good things to me.
>>
>> But I don't actually think it's an active bug, it's just an "active
>> horribly ugly and subtly working code". I guess in theory if you can
>> get lots of CPU's triggering the race at the same time, the magic
>> negative number could become zero and positive, but at that point I
>> don't think we're really talking reality any more.
>>
>> Can somebody pick holes in that? Does somebody want to send in the
>> cleanup patch?
>
> How does this look?
>
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index e99c6f5..1e7dc31 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -479,7 +479,7 @@ static void __dentry_kill(struct dentry *dentry)
>          * dentry_iput drops the locks, at which point nobody (except
>          * transient RCU lookups) can reach this dentry.
>          */
> -       BUG_ON((int)dentry->d_lockref.count > 0);
> +       BUG_ON(dentry->d_lockref.count > 0);
>         this_cpu_dec(nr_dentry);
>         if (dentry->d_op && dentry->d_op->d_release)
>                 dentry->d_op->d_release(dentry);
> @@ -532,7 +532,7 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
>         struct dentry *parent = dentry->d_parent;
>         if (IS_ROOT(dentry))
>                 return NULL;
> -       if (unlikely((int)dentry->d_lockref.count < 0))
> +       if (unlikely(dentry->d_lockref.count < 0))
>                 return NULL;
>         if (likely(spin_trylock(&parent->d_lock)))
>                 return parent;
> @@ -699,7 +699,7 @@ struct dentry *dget_parent(struct dentry *dentry)
>          */
>         rcu_read_lock();
>         ret = ACCESS_ONCE(dentry->d_parent);
> -       gotref = lockref_get_not_zero(&ret->d_lockref);
> +       gotref = lockref_get_active(&ret->d_lockref);
>         rcu_read_unlock();
>         if (likely(gotref)) {
>                 if (likely(ret == ACCESS_ONCE(dentry->d_parent)))
> @@ -848,7 +848,7 @@ static void shrink_dentry_list(struct list_head *list)
>                  * We found an inuse dentry which was not removed from
>                  * the LRU because of laziness during lookup. Do not free it.
>                  */
> -               if ((int)dentry->d_lockref.count > 0) {
> +               if (dentry->d_lockref.count > 0) {
>                         spin_unlock(&dentry->d_lock);
>                         if (parent)
>                                 spin_unlock(&parent->d_lock);
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index aec7f73..d492f0e 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -1745,7 +1745,7 @@ void gfs2_dump_glock(struct seq_file *seq, const struct gfs2_glock *gl)
>                   state2str(gl->gl_demote_state), dtime,
>                   atomic_read(&gl->gl_ail_count),
>                   atomic_read(&gl->gl_revokes),
> -                 (int)gl->gl_lockref.count, gl->gl_hold_time);
> +                 gl->gl_lockref.count, gl->gl_hold_time);
>
>         list_for_each_entry(gh, &gl->gl_holders, gh_list)
>                 dump_holder(seq, gh);

Er, grep failed me. The above delta to gfs2 is obviously bogus.

> diff --git a/include/linux/lockref.h b/include/linux/lockref.h
> index 4bfde0e..1a9827e 100644
> --- a/include/linux/lockref.h
> +++ b/include/linux/lockref.h
> @@ -28,13 +28,13 @@ struct lockref {
>  #endif
>                 struct {
>                         spinlock_t lock;
> -                       unsigned int count;
> +                       int count;
>                 };
>         };
>  };
>
>  extern void lockref_get(struct lockref *);
> -extern int lockref_get_not_zero(struct lockref *);
> +extern int lockref_get_active(struct lockref *);
>  extern int lockref_get_or_lock(struct lockref *);
>  extern int lockref_put_or_lock(struct lockref *);
>
> diff --git a/lib/lockref.c b/lib/lockref.c
> index f07a40d..7f30371 100644
> --- a/lib/lockref.c
> +++ b/lib/lockref.c
> @@ -61,17 +61,18 @@ void lockref_get(struct lockref *lockref)
>  EXPORT_SYMBOL(lockref_get);
>
>  /**
> - * lockref_get_not_zero - Increments count unless the count is 0
> + * lockref_get_active - Increments count unless the count is 0 or ref is dead
>   * @lockref: pointer to lockref structure
> - * Return: 1 if count updated successfully or 0 if count was zero
> + * Return: 1 if count updated successfully or 0 if count was zero or lockref
> + * was dead
>   */
> -int lockref_get_not_zero(struct lockref *lockref)
> +int lockref_get_active(struct lockref *lockref)
>  {
>         int retval;
>
>         CMPXCHG_LOOP(
>                 new.count++;
> -               if (!old.count)
> +               if (old.count < 1)
>                         return 0;
>         ,
>                 return 1;
> @@ -79,14 +80,14 @@ int lockref_get_not_zero(struct lockref *lockref)
>
>         spin_lock(&lockref->lock);
>         retval = 0;
> -       if (lockref->count) {
> +       if (lockref->count >= 1) {
>                 lockref->count++;
>                 retval = 1;
>         }
>         spin_unlock(&lockref->lock);
>         return retval;
>  }
> -EXPORT_SYMBOL(lockref_get_not_zero);
> +EXPORT_SYMBOL(lockref_get_active);
>
>  /**
>   * lockref_get_or_lock - Increments count unless the count is 0
> @@ -159,7 +160,7 @@ int lockref_get_not_dead(struct lockref *lockref)
>
>         CMPXCHG_LOOP(
>                 new.count++;
> -               if ((int)old.count < 0)
> +               if (old.count < 0)
>                         return 0;
>         ,
>                 return 1;
> @@ -167,7 +168,7 @@ int lockref_get_not_dead(struct lockref *lockref)
>
>         spin_lock(&lockref->lock);
>         retval = 0;
> -       if ((int) lockref->count >= 0) {
> +       if (lockref->count >= 0) {
>                 lockref->count++;
>                 retval = 1;
>         }
>
>
> I'm not too happy with the documentation string for lockref_get_active(), but
> I believe the logic is at least right. There were a few places where 'count'
> was being casted to a signed integer, and since this change turns 'count' into
> a signed value anyway, I've stripped out the casts.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is lockref_get_not_zero() really correct in dget_parent()
  2014-08-05  4:56     ` Steven Noonan
@ 2014-08-05 16:53       ` Linus Torvalds
  0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2014-08-05 16:53 UTC (permalink / raw)
  To: Steven Noonan; +Cc: NeilBrown, Al Viro, linux-fsdevel, lkml

On Mon, Aug 4, 2014 at 9:56 PM, Steven Noonan <steven@uplinklabs.net> wrote:
>
> Er, grep failed me. The above delta to gfs2 is obviously bogus.

Yeah, and looking at the patch, I think the "make lockref count
signed" should be a separate patch.

            Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is lockref_get_not_zero() really correct in dget_parent()
  2014-08-05  4:07 ` Linus Torvalds
  2014-08-05  4:54   ` Steven Noonan
@ 2014-08-05  5:14   ` Al Viro
  2014-08-05 22:58     ` Eric W. Biederman
  1 sibling, 1 reply; 7+ messages in thread
From: Al Viro @ 2014-08-05  5:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: NeilBrown, linux-fsdevel, lkml

On Mon, Aug 04, 2014 at 09:07:32PM -0700, Linus Torvalds wrote:

> So renaming it to "lockref_get_active()", and changing the "not zero"
> test to check for "positive" and change the rtype of "count" to be
> signed, all sound like good things to me.

Fine by me.  I can do that, unless somebody else beats me to that.

> But I don't actually think it's an active bug, it's just an "active
> horribly ugly and subtly working code". I guess in theory if you can
> get lots of CPU's triggering the race at the same time, the magic
> negative number could become zero and positive, but at that point I
> don't think we're really talking reality any more.
> 
> Can somebody pick holes in that? Does somebody want to send in the
> cleanup patch?

FWIW, dget_parent() has been a bloody annoyance - most of the callers used
to be racy and looking through the current set...  Take a look at e.g.
fs/btrfs/tree-log.c:check_parent_dirs_for_sync().  In the loop there we
do
                if (!parent || !parent->d_inode || sb != parent->d_inode->i_sb)
                        break;

                if (IS_ROOT(parent))
                        break;

                parent = dget_parent(parent);
                dput(old_parent);
                old_parent = parent;
                inode = parent->d_inode;
and it's so bogus it's not even funny.  What the hell is that code trying to
check?  And while we are at it, can we race with renames there?

>From my reading of that code, most of it ought to have been under
rcu_read_lock() and sod all that playing with refcounts.  Other
btrfs users are not much nicer (who says, for instance, that result of
check_parent_dirs_for_sync() will remain valid?)

ecryptfs lock_parent(): AFAICS, all callers could do ecryptfs_dentry_to_lower()
on dentry->d_parent instead of doing ecryptfs_dentry_to_lower(dentry) and
then doing dget_parent() - the covering layer dentries have ->d_parent
stabilized by ->i_mutex on said parents and we really, really don't want to
run into the case when tree topologies go out of sync.  I.e. we want to grab
->i_mutex on lower(parent), then check that it's equal to parent(lower) and,
if it's at all possible that some joker has played with rename(2) on underlying
layer mounted somewhere else, drop everything we'd taken and bugger off.

XFS xfs_filestream_get_parent() is clearly bogus -
        parent = dget_parent(dentry);
        if (!parent)
                goto out_dput;
When does dget_parent() return NULL?

And that's just from a quick grep.  The point is, dget_parent() is not a nice
API - historically it's been abused more often than not and we keep playing
whack-a-mole with it.  Sigh...

Speaking of bogosities, apparently nobody has noticed that automagical
turning BSD process accounting off upon r/o remounts of the filesystem
hosting the log got broken several years ago.  Suppose we find an opened
log when remounting.  OK, it gets closed, which means filp_close(...).
Which means that actual closing that sucker will happen just before we
return to userland.  Return with -EBUSY, since the filesystem still has
a file opened for write when we get around to checking that...

I have kinda-sorta fixes for that (basically, restoring the situation we
used to have before delayed fput() and being damn careful about avoiding
deadlocks), but I would really love to just rip that idiocy out.  IOW,
just let it fail with -EBUSY in such cases, same as for any other write-opened
file on that fs.

That, BTW, is the most painful part of the acct series, so being able to
drop it would be very nice.  Comments?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is lockref_get_not_zero() really correct in dget_parent()
  2014-08-05  5:14   ` Al Viro
@ 2014-08-05 22:58     ` Eric W. Biederman
  0 siblings, 0 replies; 7+ messages in thread
From: Eric W. Biederman @ 2014-08-05 22:58 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, NeilBrown, linux-fsdevel, lkml

Al Viro <viro@ZenIV.linux.org.uk> writes:

> Speaking of bogosities, apparently nobody has noticed that automagical
> turning BSD process accounting off upon r/o remounts of the filesystem
> hosting the log got broken several years ago.  Suppose we find an opened
> log when remounting.  OK, it gets closed, which means filp_close(...).
> Which means that actual closing that sucker will happen just before we
> return to userland.  Return with -EBUSY, since the filesystem still has
> a file opened for write when we get around to checking that...
>
> I have kinda-sorta fixes for that (basically, restoring the situation we
> used to have before delayed fput() and being damn careful about avoiding
> deadlocks), but I would really love to just rip that idiocy out.  IOW,
> just let it fail with -EBUSY in such cases, same as for any other write-opened
> file on that fs.
>
> That, BTW, is the most painful part of the acct series, so being able to
> drop it would be very nice.  Comments?

Long ago in this discucssion someone mentioned that fedora and related
distro's don't have (or at least didn't have) code to close the
accounting files before unmount during shutdown.  I just checked the
debain acct package and it very clearly turns off accounting before
unmount (so debian should not care).

So the practical test is will returning -EBUSY cause fedora based
distro's it fail to unmount filesystems cleanly (looking at the code
and postualting bsd process accounting being left on) I think such
a change will cause such systems to fail to unmount their filesystems
on shutdown/reboot and cause problems because of that.

Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-08-05 22:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-05  3:17 Is lockref_get_not_zero() really correct in dget_parent() NeilBrown
2014-08-05  4:07 ` Linus Torvalds
2014-08-05  4:54   ` Steven Noonan
2014-08-05  4:56     ` Steven Noonan
2014-08-05 16:53       ` Linus Torvalds
2014-08-05  5:14   ` Al Viro
2014-08-05 22:58     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).