linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fs/dcache: Resolve the last RT woes.
@ 2022-06-13 14:07 Sebastian Andrzej Siewior
  2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner

PREEMPT_RT has two issues with the dcache code:

   1) i_dir_seq is a special cased sequence counter which uses the lowest
      bit as writer lock. This requires that preemption is disabled.

      On !RT kernels preemption is implictly disabled by spin_lock(), but
      that's not the case on RT kernels.

      Replacing i_dir_seq on RT with a seqlock_t comes with its own
      problems due to arbitrary lock nesting. Using a seqcount with an
      associated lock is not possible either because the locks held by the
      callers are not necessarily related.

      Explicitly disabling preemption on RT kernels across the i_dir_seq
      write held critical section is the obvious and simplest solution. The
      critical section is small, so latency is not really a concern.

   2) The wake up of dentry::d_wait waiters is in a preemption disabled
      section, which violates the RT constraints as wake_up_all() has
      to acquire the wait queue head lock which is a 'sleeping' spinlock
      on RT.

      There are two reasons for the non-preemtible section:

         A) The wake up happens under the hash list bit lock

         B) The wake up happens inside the i_dir_seq write side
            critical section

       #A is solvable by moving it outside of the hash list bit lock held                                                                                                                      
       section.

       Making #B preemptible on RT is hard or even impossible due to lock
       nesting constraints.

       A possible solution would be to replace the waitqueue by a simple
       waitqueue which can be woken up inside atomic sections on RT.

       But aside of Linus not being a fan of simple waitqueues, there is
       another observation vs. this wake up. It's likely for the woken up
       waiter to immediately contend on dentry::lock.

       It turns out that there is no reason to do the wake up within the
       i_dir_seq write held region. The only requirement is to do the wake
       up within the dentry::lock held region. Further details in the
       individual patches.

       That allows to move the wake up out of the non-preemptible section
       on RT, which also reduces the dentry::lock held time after wake up.

Thanks,

Sebastian




^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
  2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
  2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
	Sebastian Andrzej Siewior, Oleg.Karfich

i_dir_seq is a sequence counter with a lock which is represented by the
lowest bit. The writer atomically updates the counter which ensures that it
can be modified by only one writer at a time. This requires preemption to
be disabled across the write side critical section.

On !PREEMPT_RT kernels this is implicit by the caller acquiring
dentry::lock. On PREEMPT_RT kernels spin_lock() does not disable preemption
which means that a preempting writer or reader would live lock. It's
therefore required to disable preemption explicitly.

An alternative solution would be to replace i_dir_seq with a seqlock_t for
PREEMPT_RT, but that comes with its own set of problems due to arbitrary
lock nesting. A pure sequence count with an associated spinlock is not
possible because the locks held by the caller are not necessarily related.

As the critical section is small, disabling preemption is a sensible
solution.

Reported-by: Oleg.Karfich@wago.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/dcache.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 93f4f5ee07bfd..92aa72fce5e2e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2563,7 +2563,15 @@ EXPORT_SYMBOL(d_rehash);
 
 static inline unsigned start_dir_add(struct inode *dir)
 {
-
+	/*
+	 * The caller holds a spinlock (dentry::d_lock). On !PREEMPT_RT
+	 * kernels spin_lock() implicitly disables preemption, but not on
+	 * PREEMPT_RT.  So for RT it has to be done explicitly to protect
+	 * the sequence count write side critical section against a reader
+	 * or another writer preempting, which would result in a live lock.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
 	for (;;) {
 		unsigned n = dir->i_dir_seq;
 		if (!(n & 1) && cmpxchg(&dir->i_dir_seq, n, n + 1) == n)
@@ -2575,6 +2583,8 @@ static inline unsigned start_dir_add(struct inode *dir)
 static inline void end_dir_add(struct inode *dir, unsigned n)
 {
 	smp_store_release(&dir->i_dir_seq, n + 2);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 static void d_wait_lookup(struct dentry *dentry)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/4] fs/dcache: Split __d_lookup_done()
  2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
  2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
  2022-07-27  3:31   ` Al Viro
  2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
	Sebastian Andrzej Siewior

__d_lookup_done() wakes waiters on dentry::d_wait inside a preemption
disabled region. This violates the PREEMPT_RT constraints as the wake up
acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.

As a first step to solve this, move the wake up outside of the
hlist_bl_lock() held section.

This is safe because:

  1) The whole sequence including the wake up is protected by dentry::lock.

  2) The waitqueue head is allocated by the caller on stack and can't go
     away until the whole callchain completes.

  3) If a queued waiter is woken by a spurious wake up, then it is blocked
     on dentry:lock before it can observe DCACHE_PAR_LOOKUP cleared and
     return from d_wait_lookup().

     As the wake up is inside the dentry:lock held region it's guaranteed
     that the waiters waitq is dequeued from the waitqueue head before the
     waiter returns.

     Moving the wake up past the unlock of dentry::lock would allow the
     waiter to return with the on stack waitq still enqueued due to a
     spurious wake up.

  4) New waiters have to acquire dentry::lock before checking whether the
     DCACHE_PAR_LOOKUP flag is set.

Let __d_lookup_unhash():

  1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
  2) Unhash the dentry
  3) Retrieve and clear dentry::d_wait
  4) Unlock the hash and return the retrieved waitqueue head pointer
  5) Let the caller handle the wake up.

This does not yet solve the PREEMPT_RT problem completely because
preemption is still disabled due to i_dir_seq being held for write. This
will be addressed in subsequent steps.

An alternative solution would be to switch the waitqueue to a simple
waitqueue, but aside of Linus not being a fan of them, moving the wake up
closer to the place where dentry::lock is unlocked reduces lock contention
time for the woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/dcache.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 92aa72fce5e2e..fae4689a9a409 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2711,18 +2711,33 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 }
 EXPORT_SYMBOL(d_alloc_parallel);
 
-void __d_lookup_done(struct dentry *dentry)
+/*
+ * - Unhash the dentry
+ * - Retrieve and clear the waitqueue head in dentry
+ * - Return the waitqueue head
+ */
+static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 {
-	struct hlist_bl_head *b = in_lookup_hash(dentry->d_parent,
-						 dentry->d_name.hash);
+	wait_queue_head_t *d_wait;
+	struct hlist_bl_head *b;
+
+	lockdep_assert_held(&dentry->d_lock);
+
+	b = in_lookup_hash(dentry->d_parent, dentry->d_name.hash);
 	hlist_bl_lock(b);
 	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
 	__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
-	wake_up_all(dentry->d_wait);
+	d_wait = dentry->d_wait;
 	dentry->d_wait = NULL;
 	hlist_bl_unlock(b);
 	INIT_HLIST_NODE(&dentry->d_u.d_alias);
 	INIT_LIST_HEAD(&dentry->d_lru);
+	return d_wait;
+}
+
+void __d_lookup_done(struct dentry *dentry)
+{
+	wake_up_all(__d_lookup_unhash(dentry));
 }
 EXPORT_SYMBOL(__d_lookup_done);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move()
  2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
  2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
  2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
  2022-07-26  3:00   ` Al Viro
  2022-06-13 14:07 ` [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
	Sebastian Andrzej Siewior

__d_add() and __d_move() invoke __d_lookup_done() from within a preemption
disabled region. This violates the PREEMPT_RT constraints as the wake up
acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.

As a preparation for solving this completely, invoke __d_lookup_unhash()
from __d_add/move() and handle the wakeup there.

This allows to move the spin_lock/unlock(dentry::lock) pair into
__d_lookup_done() which debloats the d_lookup_done() inline.

No functional change. Moving the wake up out of the preemption disabled
region on RT will be handled in a subsequent change.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/dcache.c            | 6 ++++--
 include/linux/dcache.h | 7 ++-----
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fae4689a9a409..6ef1f5c32bc0f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2737,7 +2737,9 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 
 void __d_lookup_done(struct dentry *dentry)
 {
+	spin_lock(&dentry->d_lock);
 	wake_up_all(__d_lookup_unhash(dentry));
+	spin_unlock(&dentry->d_lock);
 }
 EXPORT_SYMBOL(__d_lookup_done);
 
@@ -2751,7 +2753,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 	if (unlikely(d_in_lookup(dentry))) {
 		dir = dentry->d_parent->d_inode;
 		n = start_dir_add(dir);
-		__d_lookup_done(dentry);
+		wake_up_all(__d_lookup_unhash(dentry));
 	}
 	if (inode) {
 		unsigned add_flags = d_flags_for_inode(inode);
@@ -2940,7 +2942,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	if (unlikely(d_in_lookup(target))) {
 		dir = target->d_parent->d_inode;
 		n = start_dir_add(dir);
-		__d_lookup_done(target);
+		wake_up_all(__d_lookup_unhash(target));
 	}
 
 	write_seqcount_begin(&dentry->d_seq);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f5bba51480b2f..a07a51c858fb4 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -349,7 +349,7 @@ static inline void dont_mount(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 }
 
-extern void __d_lookup_done(struct dentry *);
+extern void __d_lookup_done(struct dentry *dentry);
 
 static inline int d_in_lookup(const struct dentry *dentry)
 {
@@ -358,11 +358,8 @@ static inline int d_in_lookup(const struct dentry *dentry)
 
 static inline void d_lookup_done(struct dentry *dentry)
 {
-	if (unlikely(d_in_lookup(dentry))) {
-		spin_lock(&dentry->d_lock);
+	if (unlikely(d_in_lookup(dentry)))
 		__d_lookup_done(dentry);
-		spin_unlock(&dentry->d_lock);
-	}
 }
 
 extern void dput(struct dentry *);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region
  2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
  2022-07-08 15:52 ` fs/dcache: Resolve the last RT woes Thomas Gleixner
  2022-07-25 15:49 ` Sebastian Andrzej Siewior
  5 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
	Sebastian Andrzej Siewior

__d_add() and __d_move() wake up waiters on dentry::d_wait from within the
i_seq_dir write held region.  This violates the PREEMPT_RT constraints as
the wake up acquires wait_queue_head::lock which is a "sleeping" spinlock
on RT.

There is no requirement to do so. __d_lookup_unhash() has cleared
DCACHE_PAR_LOOKUP and dentry::d_wait and returned the now unreachable wait
queue head pointer to the caller, so the actual wake up can be postponed
until the i_dir_seq write side critical section is left. The only
requirement is that dentry::lock is held across the whole sequence
including the wake up.

This is safe because:

  1) The whole sequence including the wake up is protected by dentry::lock.

  2) The waitqueue head is allocated by the caller on stack and can't go
     away until the whole callchain completes.

  3) If a queued waiter is woken by a spurious wake up, then it is blocked
     on dentry:lock before it can observe DCACHE_PAR_LOOKUP cleared and
     return from d_wait_lookup().

     As the wake up is inside the dentry:lock held region it's guaranteed
     that the waiters waitq is dequeued from the waitqueue head before the
     waiter returns.

     Moving the wake up past the unlock of dentry::lock would allow the
     waiter to return with the on stack waitq still enqueued due to a
     spurious wake up.

  4) New waiters have to acquire dentry::lock before checking whether the
     DCACHE_PAR_LOOKUP flag is set.

Move the wake up past end_dir_add() which leaves the i_dir_seq write side
critical section and enables preemption.

For non RT kernels there is no difference because preemption is still
disabled due to dentry::lock being held, but it shortens the time between
wake up and unlocking dentry::lock, which reduces the contention for the
woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/dcache.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 6ef1f5c32bc0f..0b5fd3a17ff7c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2747,13 +2747,15 @@ EXPORT_SYMBOL(__d_lookup_done);
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
+	wait_queue_head_t *d_wait;
 	struct inode *dir = NULL;
 	unsigned n;
+
 	spin_lock(&dentry->d_lock);
 	if (unlikely(d_in_lookup(dentry))) {
 		dir = dentry->d_parent->d_inode;
 		n = start_dir_add(dir);
-		wake_up_all(__d_lookup_unhash(dentry));
+		d_wait = __d_lookup_unhash(dentry);
 	}
 	if (inode) {
 		unsigned add_flags = d_flags_for_inode(inode);
@@ -2764,8 +2766,10 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 		fsnotify_update_flags(dentry);
 	}
 	__d_rehash(dentry);
-	if (dir)
+	if (dir) {
 		end_dir_add(dir, n);
+		wake_up_all(d_wait);
+	}
 	spin_unlock(&dentry->d_lock);
 	if (inode)
 		spin_unlock(&inode->i_lock);
@@ -2912,6 +2916,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 		     bool exchange)
 {
 	struct dentry *old_parent, *p;
+	wait_queue_head_t *d_wait;
 	struct inode *dir = NULL;
 	unsigned n;
 
@@ -2942,7 +2947,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	if (unlikely(d_in_lookup(target))) {
 		dir = target->d_parent->d_inode;
 		n = start_dir_add(dir);
-		wake_up_all(__d_lookup_unhash(target));
+		d_wait = __d_lookup_unhash(target);
 	}
 
 	write_seqcount_begin(&dentry->d_seq);
@@ -2977,8 +2982,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	write_seqcount_end(&target->d_seq);
 	write_seqcount_end(&dentry->d_seq);
 
-	if (dir)
+	if (dir) {
 		end_dir_add(dir, n);
+		wake_up_all(d_wait);
+	}
 
 	if (dentry->d_parent != old_parent)
 		spin_unlock(&dentry->d_parent->d_lock);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: fs/dcache: Resolve the last RT woes.
  2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2022-06-13 14:07 ` [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region Sebastian Andrzej Siewior
@ 2022-07-08 15:52 ` Thomas Gleixner
  2022-07-25 15:49 ` Sebastian Andrzej Siewior
  5 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2022-07-08 15:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-fsdevel; +Cc: Alexander Viro, Matthew Wilcox

On Mon, Jun 13 2022 at 16:07, Sebastian Andrzej Siewior wrote:

Polite ping ....


> PREEMPT_RT has two issues with the dcache code:
>
>    1) i_dir_seq is a special cased sequence counter which uses the lowest
>       bit as writer lock. This requires that preemption is disabled.
>
>       On !RT kernels preemption is implictly disabled by spin_lock(), but
>       that's not the case on RT kernels.
>
>       Replacing i_dir_seq on RT with a seqlock_t comes with its own
>       problems due to arbitrary lock nesting. Using a seqcount with an
>       associated lock is not possible either because the locks held by the
>       callers are not necessarily related.
>
>       Explicitly disabling preemption on RT kernels across the i_dir_seq
>       write held critical section is the obvious and simplest solution. The
>       critical section is small, so latency is not really a concern.
>
>    2) The wake up of dentry::d_wait waiters is in a preemption disabled
>       section, which violates the RT constraints as wake_up_all() has
>       to acquire the wait queue head lock which is a 'sleeping' spinlock
>       on RT.
>
>       There are two reasons for the non-preemtible section:
>
>          A) The wake up happens under the hash list bit lock
>
>          B) The wake up happens inside the i_dir_seq write side
>             critical section
>
>        #A is solvable by moving it outside of the hash list bit lock held                                                                                                                      
>        section.
>
>        Making #B preemptible on RT is hard or even impossible due to lock
>        nesting constraints.
>
>        A possible solution would be to replace the waitqueue by a simple
>        waitqueue which can be woken up inside atomic sections on RT.
>
>        But aside of Linus not being a fan of simple waitqueues, there is
>        another observation vs. this wake up. It's likely for the woken up
>        waiter to immediately contend on dentry::lock.
>
>        It turns out that there is no reason to do the wake up within the
>        i_dir_seq write held region. The only requirement is to do the wake
>        up within the dentry::lock held region. Further details in the
>        individual patches.
>
>        That allows to move the wake up out of the non-preemptible section
>        on RT, which also reduces the dentry::lock held time after wake up.
>
> Thanks,
>
> Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/dcache: Resolve the last RT woes.
  2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2022-07-08 15:52 ` fs/dcache: Resolve the last RT woes Thomas Gleixner
@ 2022-07-25 15:49 ` Sebastian Andrzej Siewior
  5 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-07-25 15:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner

On 2022-06-13 16:07:08 [+0200], To linux-fsdevel@vger.kernel.org wrote:
> PREEMPT_RT has two issues with the dcache code:

a polite ping for the series. This is one of two road blocks to get RT
enabled in v5.20. I don't want to add any pressure just point out that I
can't sit still for days since the end is near ;)

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move()
  2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
@ 2022-07-26  3:00   ` Al Viro
  2022-07-26  7:47     ` [PATCH 3/4 v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2022-07-26  3:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner

On Mon, Jun 13, 2022 at 04:07:11PM +0200, Sebastian Andrzej Siewior wrote:
> __d_add() and __d_move() invoke __d_lookup_done() from within a preemption
> disabled region. This violates the PREEMPT_RT constraints as the wake up
> acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
> 
> As a preparation for solving this completely, invoke __d_lookup_unhash()
> from __d_add/move() and handle the wakeup there.
> 
> This allows to move the spin_lock/unlock(dentry::lock) pair into
> __d_lookup_done() which debloats the d_lookup_done() inline.

It also changes calling conventions for a helper that is, unfortunately,
exported.  Sure, nobody outside of fs/dcache.c and d_lookup_done() is
supposed to be using it.  But... export is export.

Rename that sucker, please - at least that way whoever's playing with
it will get a build breakage.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/4 v2] fs/dcache: Use __d_lookup_unhash() in __d_add/move()
  2022-07-26  3:00   ` Al Viro
@ 2022-07-26  7:47     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-07-26  7:47 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner

__d_add() and __d_move() invoke __d_lookup_done() from within a preemption
disabled region. This violates the PREEMPT_RT constraints as the wake up
acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.

As a preparation for solving this completely, invoke __d_lookup_unhash()
from __d_add/move() and handle the wakeup there.

This allows to move the spin_lock/unlock(dentry::lock) pair into
__d_lookup_done() which debloats the d_lookup_done() inline. Rename
__d_lookup_done() -> __d_lookup_wake() to force build failures for OOT code
which used __d_lookup_done() and did not adapt.

No functional change. Moving the wake up out of the preemption disabled
region on RT will be handled in a subsequent change.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2:
   - Rename __d_lookup_done() -> __d_lookup_wake().

 fs/dcache.c            |   10 ++++++----
 include/linux/dcache.h |    9 +++------
 2 files changed, 9 insertions(+), 10 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2735,11 +2735,13 @@ static wait_queue_head_t *__d_lookup_unh
 	return d_wait;
 }
 
-void __d_lookup_done(struct dentry *dentry)
+void __d_lookup_wake(struct dentry *dentry)
 {
+	spin_lock(&dentry->d_lock);
 	wake_up_all(__d_lookup_unhash(dentry));
+	spin_unlock(&dentry->d_lock);
 }
-EXPORT_SYMBOL(__d_lookup_done);
+EXPORT_SYMBOL(__d_lookup_wake);
 
 /* inode->i_lock held if inode is non-NULL */
 
@@ -2751,7 +2753,7 @@ static inline void __d_add(struct dentry
 	if (unlikely(d_in_lookup(dentry))) {
 		dir = dentry->d_parent->d_inode;
 		n = start_dir_add(dir);
-		__d_lookup_done(dentry);
+		wake_up_all(__d_lookup_unhash(dentry));
 	}
 	if (inode) {
 		unsigned add_flags = d_flags_for_inode(inode);
@@ -2940,7 +2942,7 @@ static void __d_move(struct dentry *dent
 	if (unlikely(d_in_lookup(target))) {
 		dir = target->d_parent->d_inode;
 		n = start_dir_add(dir);
-		__d_lookup_done(target);
+		wake_up_all(__d_lookup_unhash(target));
 	}
 
 	write_seqcount_begin(&dentry->d_seq);
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -349,7 +349,7 @@ static inline void dont_mount(struct den
 	spin_unlock(&dentry->d_lock);
 }
 
-extern void __d_lookup_done(struct dentry *);
+extern void __d_lookup_wake(struct dentry *dentry);
 
 static inline int d_in_lookup(const struct dentry *dentry)
 {
@@ -358,11 +358,8 @@ static inline int d_in_lookup(const stru
 
 static inline void d_lookup_done(struct dentry *dentry)
 {
-	if (unlikely(d_in_lookup(dentry))) {
-		spin_lock(&dentry->d_lock);
-		__d_lookup_done(dentry);
-		spin_unlock(&dentry->d_lock);
-	}
+	if (unlikely(d_in_lookup(dentry)))
+		__d_lookup_wake(dentry);
 }
 
 extern void dput(struct dentry *);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] fs/dcache: Split __d_lookup_done()
  2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
@ 2022-07-27  3:31   ` Al Viro
  2022-07-27  3:36     ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2022-07-27  3:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner

On Mon, Jun 13, 2022 at 04:07:10PM +0200, Sebastian Andrzej Siewior wrote:
> __d_lookup_done() wakes waiters on dentry::d_wait inside a preemption
> disabled region. This violates the PREEMPT_RT constraints as the wake up
> acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.

I'd probably turn that into something like

__d_lookup_done() wakes waiters on dentry->d_wait.  On PREEMPT_RT we are
not allowed to do that with preemption disabled, since the wakeup
acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.

Calling it under dentry->d_lock is not a problem, since that is also
a "sleeping" spinlock on the same configs.  Unfortunately, two of
its callers (__d_add() and __d_move()) are holding more than just ->d_lock
and that needs to be dealt with.

The key observation is that wakeup can be moved to any point before
dropping ->d_lock.

> As a first step to solve this, move the wake up outside of the
> hlist_bl_lock() held section.
> 
> This is safe because:
> 
>   1) The whole sequence including the wake up is protected by dentry::lock.
> 
>   2) The waitqueue head is allocated by the caller on stack and can't go
>      away until the whole callchain completes.

	That's too vague and in one case simply incorrect - the call
of d_alloc_parallel() in nfs_call_unlink() does *not* have wq in stack
frame of anything in the callchain.  Incidentally, another unusual caller
(d_add_ci()) has a bug (see below).  What really matters is that we can't
reach destruction of wq without __d_lookup_done() under ->d_lock.

Waiters get inserted into ->d_wait only after they'd taken ->d_lock
and observed DCACHE_PAR_LOOKUP in flags.  As long as they are
woken up (and evicted from the queue) between the moment __d_lookup_done()
has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
since the waitqueue ->d_wait points to won't get destroyed without
having __d_lookup_done(dentry) called (under ->d_lock).

->d_wait is set only by d_alloc_parallel() and only in case when
it returns a freshly allocated in-lookup dentry.  Whenever that happens,
we are guaranteed that __d_lookup_done() will be called for resulting
dentry (under ->d_lock) before the wq in question gets destroyed.

With two exceptions wq lives in call frame of the caller of
d_alloc_parallel() and we have an explicit d_lookup_done() on the
resulting in-lookup dentry before we leave that frame.

One of those exceptions is nfs_call_unlink(), where wq is embedded into
(dynamically allocated) struct nfs_unlinkdata.  It is destroyed in
nfs_async_unlink_release() after an explicit d_lookup_done() on the
dentry wq went into.

Remaining exception is d_add_ci().  There wq is what we'd found in
->d_wait of d_add_ci() argument.  Callers of d_add_ci() are two
instances of ->d_lookup() and they must have been given an in-lookup
dentry.  Which means that they'd been called by __lookup_slow() or
lookup_open(), with wq in the call frame of one of those.

[[[
Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which *NORMALLY* feeds it to __d_add() or
__d_move() in a way that will have __d_lookup_done() applied to it.

	However, there is a nasty possibility - d_splice_alias() might
legitimately fail without having marked the sucker not in-lookup.  dentry
will get dropped by d_add_ci(), so ->d_wait won't end up pointing to freed
object, but it's still a bug - retain_dentry() will scream bloody murder
upon seeing that, and for a good reason; we'll get hash chain corrupted.
It's impossible to hit without corrupted fs image (ntfs or case-insensitive
xfs), but it's a bug.  Fix is a one-liner (add d_lookup_done(found);
right after
        res = d_splice_alias(inode, found);
	if (res) {
in d_add_ci()) and with that done the last sentence about d_add_ci() turns
into
]]]

Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which either returns non-NULL (and d_add_ci() does
d_lookup_done()) or feeds dentry to __d_add() that will do
__d_lookup_done() under ->d_lock.  That concludes the analysis.


PS: I'm not sure we need to do this migration of wakeup in stages;
lift it into the caller of __d_lookup_done() as the first step,
then move the damn thing all the way to end_dir_add().  Analysis
can go into either...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] fs/dcache: Split __d_lookup_done()
  2022-07-27  3:31   ` Al Viro
@ 2022-07-27  3:36     ` Al Viro
  0 siblings, 0 replies; 11+ messages in thread
From: Al Viro @ 2022-07-27  3:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner

On Wed, Jul 27, 2022 at 04:31:21AM +0100, Al Viro wrote:
> On Mon, Jun 13, 2022 at 04:07:10PM +0200, Sebastian Andrzej Siewior wrote:
> > __d_lookup_done() wakes waiters on dentry::d_wait inside a preemption
> > disabled region. This violates the PREEMPT_RT constraints as the wake up
> > acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
> 
> I'd probably turn that into something like
> 
> __d_lookup_done() wakes waiters on dentry->d_wait.  On PREEMPT_RT we are
> not allowed to do that with preemption disabled, since the wakeup
> acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.
> 
> Calling it under dentry->d_lock is not a problem, since that is also
> a "sleeping" spinlock on the same configs.  Unfortunately, two of
> its callers (__d_add() and __d_move()) are holding more than just ->d_lock
> and that needs to be dealt with.
> 
> The key observation is that wakeup can be moved to any point before
> dropping ->d_lock.
> 
> > As a first step to solve this, move the wake up outside of the
> > hlist_bl_lock() held section.
> > 
> > This is safe because:
> > 
> >   1) The whole sequence including the wake up is protected by dentry::lock.
> > 
> >   2) The waitqueue head is allocated by the caller on stack and can't go
> >      away until the whole callchain completes.
> 
> 	That's too vague and in one case simply incorrect - the call
> of d_alloc_parallel() in nfs_call_unlink() does *not* have wq in stack
> frame of anything in the callchain.  Incidentally, another unusual caller
> (d_add_ci()) has a bug (see below).  What really matters is that we can't
> reach destruction of wq without __d_lookup_done() under ->d_lock.
> 
> Waiters get inserted into ->d_wait only after they'd taken ->d_lock
> and observed DCACHE_PAR_LOOKUP in flags.  As long as they are
> woken up (and evicted from the queue) between the moment __d_lookup_done()
> has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
> since the waitqueue ->d_wait points to won't get destroyed without
> having __d_lookup_done(dentry) called (under ->d_lock).
> 
> ->d_wait is set only by d_alloc_parallel() and only in case when
> it returns a freshly allocated in-lookup dentry.  Whenever that happens,
> we are guaranteed that __d_lookup_done() will be called for resulting
> dentry (under ->d_lock) before the wq in question gets destroyed.
> 
> With two exceptions wq lives in call frame of the caller of
> d_alloc_parallel() and we have an explicit d_lookup_done() on the
> resulting in-lookup dentry before we leave that frame.
> 
> One of those exceptions is nfs_call_unlink(), where wq is embedded into
> (dynamically allocated) struct nfs_unlinkdata.  It is destroyed in
> nfs_async_unlink_release() after an explicit d_lookup_done() on the
> dentry wq went into.
> 
> Remaining exception is d_add_ci().  There wq is what we'd found in
> ->d_wait of d_add_ci() argument.  Callers of d_add_ci() are two
> instances of ->d_lookup() and they must have been given an in-lookup
> dentry.  Which means that they'd been called by __lookup_slow() or
> lookup_open(), with wq in the call frame of one of those.
> 
> [[[
> Result of d_alloc_parallel() in d_add_ci() is fed to
> d_splice_alias(), which *NORMALLY* feeds it to __d_add() or
> __d_move() in a way that will have __d_lookup_done() applied to it.
> 
> 	However, there is a nasty possibility - d_splice_alias() might
> legitimately fail without having marked the sucker not in-lookup.  dentry
> will get dropped by d_add_ci(), so ->d_wait won't end up pointing to freed
> object, but it's still a bug - retain_dentry() will scream bloody murder
> upon seeing that, and for a good reason; we'll get hash chain corrupted.
> It's impossible to hit without corrupted fs image (ntfs or case-insensitive
> xfs), but it's a bug.  Fix is a one-liner (add d_lookup_done(found);
> right after
>         res = d_splice_alias(inode, found);
> 	if (res) {
> in d_add_ci()) and with that done the last sentence about d_add_ci() turns
> into
> ]]]
> 
> Result of d_alloc_parallel() in d_add_ci() is fed to
> d_splice_alias(), which either returns non-NULL (and d_add_ci() does
> d_lookup_done()) or feeds dentry to __d_add() that will do
> __d_lookup_done() under ->d_lock.  That concludes the analysis.
> 
> 
> PS: I'm not sure we need to do this migration of wakeup in stages;
> lift it into the caller of __d_lookup_done() as the first step,
> then move the damn thing all the way to end_dir_add().  Analysis
> can go into either...

PPS: I'm OK with the series in its current form; resplit mentioned above
is a matter of taste and if you prefer to keep the commit boundaries
as-is, I'm fine with that.  Commit message really needs to be changed,
though.  Would you be OK with the changes as above?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-07-27  3:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
2022-07-27  3:31   ` Al Viro
2022-07-27  3:36     ` Al Viro
2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
2022-07-26  3:00   ` Al Viro
2022-07-26  7:47     ` [PATCH 3/4 v2] " Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region Sebastian Andrzej Siewior
2022-07-08 15:52 ` fs/dcache: Resolve the last RT woes Thomas Gleixner
2022-07-25 15:49 ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).