* fs/dcache: Resolve the last RT woes.
@ 2022-06-13 14:07 Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
` (5 more replies)
0 siblings, 6 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner
PREEMPT_RT has two issues with the dcache code:
1) i_dir_seq is a special cased sequence counter which uses the lowest
bit as writer lock. This requires that preemption is disabled.
On !RT kernels preemption is implictly disabled by spin_lock(), but
that's not the case on RT kernels.
Replacing i_dir_seq on RT with a seqlock_t comes with its own
problems due to arbitrary lock nesting. Using a seqcount with an
associated lock is not possible either because the locks held by the
callers are not necessarily related.
Explicitly disabling preemption on RT kernels across the i_dir_seq
write held critical section is the obvious and simplest solution. The
critical section is small, so latency is not really a concern.
2) The wake up of dentry::d_wait waiters is in a preemption disabled
section, which violates the RT constraints as wake_up_all() has
to acquire the wait queue head lock which is a 'sleeping' spinlock
on RT.
There are two reasons for the non-preemtible section:
A) The wake up happens under the hash list bit lock
B) The wake up happens inside the i_dir_seq write side
critical section
#A is solvable by moving it outside of the hash list bit lock held
section.
Making #B preemptible on RT is hard or even impossible due to lock
nesting constraints.
A possible solution would be to replace the waitqueue by a simple
waitqueue which can be woken up inside atomic sections on RT.
But aside of Linus not being a fan of simple waitqueues, there is
another observation vs. this wake up. It's likely for the woken up
waiter to immediately contend on dentry::lock.
It turns out that there is no reason to do the wake up within the
i_dir_seq write held region. The only requirement is to do the wake
up within the dentry::lock held region. Further details in the
individual patches.
That allows to move the wake up out of the non-preemptible section
on RT, which also reduces the dentry::lock held time after wake up.
Thanks,
Sebastian
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
` (4 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
To: linux-fsdevel
Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
Sebastian Andrzej Siewior, Oleg.Karfich
i_dir_seq is a sequence counter with a lock which is represented by the
lowest bit. The writer atomically updates the counter which ensures that it
can be modified by only one writer at a time. This requires preemption to
be disabled across the write side critical section.
On !PREEMPT_RT kernels this is implicit by the caller acquiring
dentry::lock. On PREEMPT_RT kernels spin_lock() does not disable preemption
which means that a preempting writer or reader would live lock. It's
therefore required to disable preemption explicitly.
An alternative solution would be to replace i_dir_seq with a seqlock_t for
PREEMPT_RT, but that comes with its own set of problems due to arbitrary
lock nesting. A pure sequence count with an associated spinlock is not
possible because the locks held by the caller are not necessarily related.
As the critical section is small, disabling preemption is a sensible
solution.
Reported-by: Oleg.Karfich@wago.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
fs/dcache.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 93f4f5ee07bfd..92aa72fce5e2e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2563,7 +2563,15 @@ EXPORT_SYMBOL(d_rehash);
static inline unsigned start_dir_add(struct inode *dir)
{
-
+ /*
+ * The caller holds a spinlock (dentry::d_lock). On !PREEMPT_RT
+ * kernels spin_lock() implicitly disables preemption, but not on
+ * PREEMPT_RT. So for RT it has to be done explicitly to protect
+ * the sequence count write side critical section against a reader
+ * or another writer preempting, which would result in a live lock.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_disable();
for (;;) {
unsigned n = dir->i_dir_seq;
if (!(n & 1) && cmpxchg(&dir->i_dir_seq, n, n + 1) == n)
@@ -2575,6 +2583,8 @@ static inline unsigned start_dir_add(struct inode *dir)
static inline void end_dir_add(struct inode *dir, unsigned n)
{
smp_store_release(&dir->i_dir_seq, n + 2);
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_enable();
}
static void d_wait_lookup(struct dentry *dentry)
--
2.36.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/4] fs/dcache: Split __d_lookup_done()
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
2022-07-27 3:31 ` Al Viro
2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
` (3 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
To: linux-fsdevel
Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
Sebastian Andrzej Siewior
__d_lookup_done() wakes waiters on dentry::d_wait inside a preemption
disabled region. This violates the PREEMPT_RT constraints as the wake up
acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
As a first step to solve this, move the wake up outside of the
hlist_bl_lock() held section.
This is safe because:
1) The whole sequence including the wake up is protected by dentry::lock.
2) The waitqueue head is allocated by the caller on stack and can't go
away until the whole callchain completes.
3) If a queued waiter is woken by a spurious wake up, then it is blocked
on dentry:lock before it can observe DCACHE_PAR_LOOKUP cleared and
return from d_wait_lookup().
As the wake up is inside the dentry:lock held region it's guaranteed
that the waiters waitq is dequeued from the waitqueue head before the
waiter returns.
Moving the wake up past the unlock of dentry::lock would allow the
waiter to return with the on stack waitq still enqueued due to a
spurious wake up.
4) New waiters have to acquire dentry::lock before checking whether the
DCACHE_PAR_LOOKUP flag is set.
Let __d_lookup_unhash():
1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
2) Unhash the dentry
3) Retrieve and clear dentry::d_wait
4) Unlock the hash and return the retrieved waitqueue head pointer
5) Let the caller handle the wake up.
This does not yet solve the PREEMPT_RT problem completely because
preemption is still disabled due to i_dir_seq being held for write. This
will be addressed in subsequent steps.
An alternative solution would be to switch the waitqueue to a simple
waitqueue, but aside of Linus not being a fan of them, moving the wake up
closer to the place where dentry::lock is unlocked reduces lock contention
time for the woken up waiter.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
fs/dcache.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 92aa72fce5e2e..fae4689a9a409 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2711,18 +2711,33 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
}
EXPORT_SYMBOL(d_alloc_parallel);
-void __d_lookup_done(struct dentry *dentry)
+/*
+ * - Unhash the dentry
+ * - Retrieve and clear the waitqueue head in dentry
+ * - Return the waitqueue head
+ */
+static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
{
- struct hlist_bl_head *b = in_lookup_hash(dentry->d_parent,
- dentry->d_name.hash);
+ wait_queue_head_t *d_wait;
+ struct hlist_bl_head *b;
+
+ lockdep_assert_held(&dentry->d_lock);
+
+ b = in_lookup_hash(dentry->d_parent, dentry->d_name.hash);
hlist_bl_lock(b);
dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
- wake_up_all(dentry->d_wait);
+ d_wait = dentry->d_wait;
dentry->d_wait = NULL;
hlist_bl_unlock(b);
INIT_HLIST_NODE(&dentry->d_u.d_alias);
INIT_LIST_HEAD(&dentry->d_lru);
+ return d_wait;
+}
+
+void __d_lookup_done(struct dentry *dentry)
+{
+ wake_up_all(__d_lookup_unhash(dentry));
}
EXPORT_SYMBOL(__d_lookup_done);
--
2.36.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move()
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
2022-07-26 3:00 ` Al Viro
2022-06-13 14:07 ` [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region Sebastian Andrzej Siewior
` (2 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
To: linux-fsdevel
Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
Sebastian Andrzej Siewior
__d_add() and __d_move() invoke __d_lookup_done() from within a preemption
disabled region. This violates the PREEMPT_RT constraints as the wake up
acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
As a preparation for solving this completely, invoke __d_lookup_unhash()
from __d_add/move() and handle the wakeup there.
This allows to move the spin_lock/unlock(dentry::lock) pair into
__d_lookup_done() which debloats the d_lookup_done() inline.
No functional change. Moving the wake up out of the preemption disabled
region on RT will be handled in a subsequent change.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
fs/dcache.c | 6 ++++--
include/linux/dcache.h | 7 ++-----
2 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index fae4689a9a409..6ef1f5c32bc0f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2737,7 +2737,9 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
void __d_lookup_done(struct dentry *dentry)
{
+ spin_lock(&dentry->d_lock);
wake_up_all(__d_lookup_unhash(dentry));
+ spin_unlock(&dentry->d_lock);
}
EXPORT_SYMBOL(__d_lookup_done);
@@ -2751,7 +2753,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
if (unlikely(d_in_lookup(dentry))) {
dir = dentry->d_parent->d_inode;
n = start_dir_add(dir);
- __d_lookup_done(dentry);
+ wake_up_all(__d_lookup_unhash(dentry));
}
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
@@ -2940,7 +2942,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
if (unlikely(d_in_lookup(target))) {
dir = target->d_parent->d_inode;
n = start_dir_add(dir);
- __d_lookup_done(target);
+ wake_up_all(__d_lookup_unhash(target));
}
write_seqcount_begin(&dentry->d_seq);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f5bba51480b2f..a07a51c858fb4 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -349,7 +349,7 @@ static inline void dont_mount(struct dentry *dentry)
spin_unlock(&dentry->d_lock);
}
-extern void __d_lookup_done(struct dentry *);
+extern void __d_lookup_done(struct dentry *dentry);
static inline int d_in_lookup(const struct dentry *dentry)
{
@@ -358,11 +358,8 @@ static inline int d_in_lookup(const struct dentry *dentry)
static inline void d_lookup_done(struct dentry *dentry)
{
- if (unlikely(d_in_lookup(dentry))) {
- spin_lock(&dentry->d_lock);
+ if (unlikely(d_in_lookup(dentry)))
__d_lookup_done(dentry);
- spin_unlock(&dentry->d_lock);
- }
}
extern void dput(struct dentry *);
--
2.36.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
@ 2022-06-13 14:07 ` Sebastian Andrzej Siewior
2022-07-08 15:52 ` fs/dcache: Resolve the last RT woes Thomas Gleixner
2022-07-25 15:49 ` Sebastian Andrzej Siewior
5 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-13 14:07 UTC (permalink / raw)
To: linux-fsdevel
Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner,
Sebastian Andrzej Siewior
__d_add() and __d_move() wake up waiters on dentry::d_wait from within the
i_seq_dir write held region. This violates the PREEMPT_RT constraints as
the wake up acquires wait_queue_head::lock which is a "sleeping" spinlock
on RT.
There is no requirement to do so. __d_lookup_unhash() has cleared
DCACHE_PAR_LOOKUP and dentry::d_wait and returned the now unreachable wait
queue head pointer to the caller, so the actual wake up can be postponed
until the i_dir_seq write side critical section is left. The only
requirement is that dentry::lock is held across the whole sequence
including the wake up.
This is safe because:
1) The whole sequence including the wake up is protected by dentry::lock.
2) The waitqueue head is allocated by the caller on stack and can't go
away until the whole callchain completes.
3) If a queued waiter is woken by a spurious wake up, then it is blocked
on dentry:lock before it can observe DCACHE_PAR_LOOKUP cleared and
return from d_wait_lookup().
As the wake up is inside the dentry:lock held region it's guaranteed
that the waiters waitq is dequeued from the waitqueue head before the
waiter returns.
Moving the wake up past the unlock of dentry::lock would allow the
waiter to return with the on stack waitq still enqueued due to a
spurious wake up.
4) New waiters have to acquire dentry::lock before checking whether the
DCACHE_PAR_LOOKUP flag is set.
Move the wake up past end_dir_add() which leaves the i_dir_seq write side
critical section and enables preemption.
For non RT kernels there is no difference because preemption is still
disabled due to dentry::lock being held, but it shortens the time between
wake up and unlocking dentry::lock, which reduces the contention for the
woken up waiter.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
fs/dcache.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 6ef1f5c32bc0f..0b5fd3a17ff7c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2747,13 +2747,15 @@ EXPORT_SYMBOL(__d_lookup_done);
static inline void __d_add(struct dentry *dentry, struct inode *inode)
{
+ wait_queue_head_t *d_wait;
struct inode *dir = NULL;
unsigned n;
+
spin_lock(&dentry->d_lock);
if (unlikely(d_in_lookup(dentry))) {
dir = dentry->d_parent->d_inode;
n = start_dir_add(dir);
- wake_up_all(__d_lookup_unhash(dentry));
+ d_wait = __d_lookup_unhash(dentry);
}
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
@@ -2764,8 +2766,10 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
fsnotify_update_flags(dentry);
}
__d_rehash(dentry);
- if (dir)
+ if (dir) {
end_dir_add(dir, n);
+ wake_up_all(d_wait);
+ }
spin_unlock(&dentry->d_lock);
if (inode)
spin_unlock(&inode->i_lock);
@@ -2912,6 +2916,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
bool exchange)
{
struct dentry *old_parent, *p;
+ wait_queue_head_t *d_wait;
struct inode *dir = NULL;
unsigned n;
@@ -2942,7 +2947,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
if (unlikely(d_in_lookup(target))) {
dir = target->d_parent->d_inode;
n = start_dir_add(dir);
- wake_up_all(__d_lookup_unhash(target));
+ d_wait = __d_lookup_unhash(target);
}
write_seqcount_begin(&dentry->d_seq);
@@ -2977,8 +2982,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
write_seqcount_end(&target->d_seq);
write_seqcount_end(&dentry->d_seq);
- if (dir)
+ if (dir) {
end_dir_add(dir, n);
+ wake_up_all(d_wait);
+ }
if (dentry->d_parent != old_parent)
spin_unlock(&dentry->d_parent->d_lock);
--
2.36.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: fs/dcache: Resolve the last RT woes.
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2022-06-13 14:07 ` [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region Sebastian Andrzej Siewior
@ 2022-07-08 15:52 ` Thomas Gleixner
2022-07-25 15:49 ` Sebastian Andrzej Siewior
5 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2022-07-08 15:52 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-fsdevel; +Cc: Alexander Viro, Matthew Wilcox
On Mon, Jun 13 2022 at 16:07, Sebastian Andrzej Siewior wrote:
Polite ping ....
> PREEMPT_RT has two issues with the dcache code:
>
> 1) i_dir_seq is a special cased sequence counter which uses the lowest
> bit as writer lock. This requires that preemption is disabled.
>
> On !RT kernels preemption is implictly disabled by spin_lock(), but
> that's not the case on RT kernels.
>
> Replacing i_dir_seq on RT with a seqlock_t comes with its own
> problems due to arbitrary lock nesting. Using a seqcount with an
> associated lock is not possible either because the locks held by the
> callers are not necessarily related.
>
> Explicitly disabling preemption on RT kernels across the i_dir_seq
> write held critical section is the obvious and simplest solution. The
> critical section is small, so latency is not really a concern.
>
> 2) The wake up of dentry::d_wait waiters is in a preemption disabled
> section, which violates the RT constraints as wake_up_all() has
> to acquire the wait queue head lock which is a 'sleeping' spinlock
> on RT.
>
> There are two reasons for the non-preemtible section:
>
> A) The wake up happens under the hash list bit lock
>
> B) The wake up happens inside the i_dir_seq write side
> critical section
>
> #A is solvable by moving it outside of the hash list bit lock held
> section.
>
> Making #B preemptible on RT is hard or even impossible due to lock
> nesting constraints.
>
> A possible solution would be to replace the waitqueue by a simple
> waitqueue which can be woken up inside atomic sections on RT.
>
> But aside of Linus not being a fan of simple waitqueues, there is
> another observation vs. this wake up. It's likely for the woken up
> waiter to immediately contend on dentry::lock.
>
> It turns out that there is no reason to do the wake up within the
> i_dir_seq write held region. The only requirement is to do the wake
> up within the dentry::lock held region. Further details in the
> individual patches.
>
> That allows to move the wake up out of the non-preemptible section
> on RT, which also reduces the dentry::lock held time after wake up.
>
> Thanks,
>
> Sebastian
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/dcache: Resolve the last RT woes.
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
` (4 preceding siblings ...)
2022-07-08 15:52 ` fs/dcache: Resolve the last RT woes Thomas Gleixner
@ 2022-07-25 15:49 ` Sebastian Andrzej Siewior
5 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-07-25 15:49 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Alexander Viro, Matthew Wilcox, Thomas Gleixner
On 2022-06-13 16:07:08 [+0200], To linux-fsdevel@vger.kernel.org wrote:
> PREEMPT_RT has two issues with the dcache code:
a polite ping for the series. This is one of two road blocks to get RT
enabled in v5.20. I don't want to add any pressure just point out that I
can't sit still for days since the end is near ;)
Sebastian
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move()
2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
@ 2022-07-26 3:00 ` Al Viro
2022-07-26 7:47 ` [PATCH 3/4 v2] " Sebastian Andrzej Siewior
0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2022-07-26 3:00 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner
On Mon, Jun 13, 2022 at 04:07:11PM +0200, Sebastian Andrzej Siewior wrote:
> __d_add() and __d_move() invoke __d_lookup_done() from within a preemption
> disabled region. This violates the PREEMPT_RT constraints as the wake up
> acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
>
> As a preparation for solving this completely, invoke __d_lookup_unhash()
> from __d_add/move() and handle the wakeup there.
>
> This allows to move the spin_lock/unlock(dentry::lock) pair into
> __d_lookup_done() which debloats the d_lookup_done() inline.
It also changes calling conventions for a helper that is, unfortunately,
exported. Sure, nobody outside of fs/dcache.c and d_lookup_done() is
supposed to be using it. But... export is export.
Rename that sucker, please - at least that way whoever's playing with
it will get a build breakage.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 3/4 v2] fs/dcache: Use __d_lookup_unhash() in __d_add/move()
2022-07-26 3:00 ` Al Viro
@ 2022-07-26 7:47 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-07-26 7:47 UTC (permalink / raw)
To: Al Viro; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner
__d_add() and __d_move() invoke __d_lookup_done() from within a preemption
disabled region. This violates the PREEMPT_RT constraints as the wake up
acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
As a preparation for solving this completely, invoke __d_lookup_unhash()
from __d_add/move() and handle the wakeup there.
This allows to move the spin_lock/unlock(dentry::lock) pair into
__d_lookup_done() which debloats the d_lookup_done() inline. Rename
__d_lookup_done() -> __d_lookup_wake() to force build failures for OOT code
which used __d_lookup_done() and did not adapt.
No functional change. Moving the wake up out of the preemption disabled
region on RT will be handled in a subsequent change.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2:
- Rename __d_lookup_done() -> __d_lookup_wake().
fs/dcache.c | 10 ++++++----
include/linux/dcache.h | 9 +++------
2 files changed, 9 insertions(+), 10 deletions(-)
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2735,11 +2735,13 @@ static wait_queue_head_t *__d_lookup_unh
return d_wait;
}
-void __d_lookup_done(struct dentry *dentry)
+void __d_lookup_wake(struct dentry *dentry)
{
+ spin_lock(&dentry->d_lock);
wake_up_all(__d_lookup_unhash(dentry));
+ spin_unlock(&dentry->d_lock);
}
-EXPORT_SYMBOL(__d_lookup_done);
+EXPORT_SYMBOL(__d_lookup_wake);
/* inode->i_lock held if inode is non-NULL */
@@ -2751,7 +2753,7 @@ static inline void __d_add(struct dentry
if (unlikely(d_in_lookup(dentry))) {
dir = dentry->d_parent->d_inode;
n = start_dir_add(dir);
- __d_lookup_done(dentry);
+ wake_up_all(__d_lookup_unhash(dentry));
}
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
@@ -2940,7 +2942,7 @@ static void __d_move(struct dentry *dent
if (unlikely(d_in_lookup(target))) {
dir = target->d_parent->d_inode;
n = start_dir_add(dir);
- __d_lookup_done(target);
+ wake_up_all(__d_lookup_unhash(target));
}
write_seqcount_begin(&dentry->d_seq);
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -349,7 +349,7 @@ static inline void dont_mount(struct den
spin_unlock(&dentry->d_lock);
}
-extern void __d_lookup_done(struct dentry *);
+extern void __d_lookup_wake(struct dentry *dentry);
static inline int d_in_lookup(const struct dentry *dentry)
{
@@ -358,11 +358,8 @@ static inline int d_in_lookup(const stru
static inline void d_lookup_done(struct dentry *dentry)
{
- if (unlikely(d_in_lookup(dentry))) {
- spin_lock(&dentry->d_lock);
- __d_lookup_done(dentry);
- spin_unlock(&dentry->d_lock);
- }
+ if (unlikely(d_in_lookup(dentry)))
+ __d_lookup_wake(dentry);
}
extern void dput(struct dentry *);
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/4] fs/dcache: Split __d_lookup_done()
2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
@ 2022-07-27 3:31 ` Al Viro
2022-07-27 3:36 ` Al Viro
0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2022-07-27 3:31 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner
On Mon, Jun 13, 2022 at 04:07:10PM +0200, Sebastian Andrzej Siewior wrote:
> __d_lookup_done() wakes waiters on dentry::d_wait inside a preemption
> disabled region. This violates the PREEMPT_RT constraints as the wake up
> acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
I'd probably turn that into something like
__d_lookup_done() wakes waiters on dentry->d_wait. On PREEMPT_RT we are
not allowed to do that with preemption disabled, since the wakeup
acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.
Calling it under dentry->d_lock is not a problem, since that is also
a "sleeping" spinlock on the same configs. Unfortunately, two of
its callers (__d_add() and __d_move()) are holding more than just ->d_lock
and that needs to be dealt with.
The key observation is that wakeup can be moved to any point before
dropping ->d_lock.
> As a first step to solve this, move the wake up outside of the
> hlist_bl_lock() held section.
>
> This is safe because:
>
> 1) The whole sequence including the wake up is protected by dentry::lock.
>
> 2) The waitqueue head is allocated by the caller on stack and can't go
> away until the whole callchain completes.
That's too vague and in one case simply incorrect - the call
of d_alloc_parallel() in nfs_call_unlink() does *not* have wq in stack
frame of anything in the callchain. Incidentally, another unusual caller
(d_add_ci()) has a bug (see below). What really matters is that we can't
reach destruction of wq without __d_lookup_done() under ->d_lock.
Waiters get inserted into ->d_wait only after they'd taken ->d_lock
and observed DCACHE_PAR_LOOKUP in flags. As long as they are
woken up (and evicted from the queue) between the moment __d_lookup_done()
has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
since the waitqueue ->d_wait points to won't get destroyed without
having __d_lookup_done(dentry) called (under ->d_lock).
->d_wait is set only by d_alloc_parallel() and only in case when
it returns a freshly allocated in-lookup dentry. Whenever that happens,
we are guaranteed that __d_lookup_done() will be called for resulting
dentry (under ->d_lock) before the wq in question gets destroyed.
With two exceptions wq lives in call frame of the caller of
d_alloc_parallel() and we have an explicit d_lookup_done() on the
resulting in-lookup dentry before we leave that frame.
One of those exceptions is nfs_call_unlink(), where wq is embedded into
(dynamically allocated) struct nfs_unlinkdata. It is destroyed in
nfs_async_unlink_release() after an explicit d_lookup_done() on the
dentry wq went into.
Remaining exception is d_add_ci(). There wq is what we'd found in
->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
instances of ->d_lookup() and they must have been given an in-lookup
dentry. Which means that they'd been called by __lookup_slow() or
lookup_open(), with wq in the call frame of one of those.
[[[
Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which *NORMALLY* feeds it to __d_add() or
__d_move() in a way that will have __d_lookup_done() applied to it.
However, there is a nasty possibility - d_splice_alias() might
legitimately fail without having marked the sucker not in-lookup. dentry
will get dropped by d_add_ci(), so ->d_wait won't end up pointing to freed
object, but it's still a bug - retain_dentry() will scream bloody murder
upon seeing that, and for a good reason; we'll get hash chain corrupted.
It's impossible to hit without corrupted fs image (ntfs or case-insensitive
xfs), but it's a bug. Fix is a one-liner (add d_lookup_done(found);
right after
res = d_splice_alias(inode, found);
if (res) {
in d_add_ci()) and with that done the last sentence about d_add_ci() turns
into
]]]
Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which either returns non-NULL (and d_add_ci() does
d_lookup_done()) or feeds dentry to __d_add() that will do
__d_lookup_done() under ->d_lock. That concludes the analysis.
PS: I'm not sure we need to do this migration of wakeup in stages;
lift it into the caller of __d_lookup_done() as the first step,
then move the damn thing all the way to end_dir_add(). Analysis
can go into either...
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/4] fs/dcache: Split __d_lookup_done()
2022-07-27 3:31 ` Al Viro
@ 2022-07-27 3:36 ` Al Viro
0 siblings, 0 replies; 11+ messages in thread
From: Al Viro @ 2022-07-27 3:36 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: linux-fsdevel, Matthew Wilcox, Thomas Gleixner
On Wed, Jul 27, 2022 at 04:31:21AM +0100, Al Viro wrote:
> On Mon, Jun 13, 2022 at 04:07:10PM +0200, Sebastian Andrzej Siewior wrote:
> > __d_lookup_done() wakes waiters on dentry::d_wait inside a preemption
> > disabled region. This violates the PREEMPT_RT constraints as the wake up
> > acquires wait_queue_head::lock which is a "sleeping" spinlock on RT.
>
> I'd probably turn that into something like
>
> __d_lookup_done() wakes waiters on dentry->d_wait. On PREEMPT_RT we are
> not allowed to do that with preemption disabled, since the wakeup
> acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.
>
> Calling it under dentry->d_lock is not a problem, since that is also
> a "sleeping" spinlock on the same configs. Unfortunately, two of
> its callers (__d_add() and __d_move()) are holding more than just ->d_lock
> and that needs to be dealt with.
>
> The key observation is that wakeup can be moved to any point before
> dropping ->d_lock.
>
> > As a first step to solve this, move the wake up outside of the
> > hlist_bl_lock() held section.
> >
> > This is safe because:
> >
> > 1) The whole sequence including the wake up is protected by dentry::lock.
> >
> > 2) The waitqueue head is allocated by the caller on stack and can't go
> > away until the whole callchain completes.
>
> That's too vague and in one case simply incorrect - the call
> of d_alloc_parallel() in nfs_call_unlink() does *not* have wq in stack
> frame of anything in the callchain. Incidentally, another unusual caller
> (d_add_ci()) has a bug (see below). What really matters is that we can't
> reach destruction of wq without __d_lookup_done() under ->d_lock.
>
> Waiters get inserted into ->d_wait only after they'd taken ->d_lock
> and observed DCACHE_PAR_LOOKUP in flags. As long as they are
> woken up (and evicted from the queue) between the moment __d_lookup_done()
> has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
> since the waitqueue ->d_wait points to won't get destroyed without
> having __d_lookup_done(dentry) called (under ->d_lock).
>
> ->d_wait is set only by d_alloc_parallel() and only in case when
> it returns a freshly allocated in-lookup dentry. Whenever that happens,
> we are guaranteed that __d_lookup_done() will be called for resulting
> dentry (under ->d_lock) before the wq in question gets destroyed.
>
> With two exceptions wq lives in call frame of the caller of
> d_alloc_parallel() and we have an explicit d_lookup_done() on the
> resulting in-lookup dentry before we leave that frame.
>
> One of those exceptions is nfs_call_unlink(), where wq is embedded into
> (dynamically allocated) struct nfs_unlinkdata. It is destroyed in
> nfs_async_unlink_release() after an explicit d_lookup_done() on the
> dentry wq went into.
>
> Remaining exception is d_add_ci(). There wq is what we'd found in
> ->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
> instances of ->d_lookup() and they must have been given an in-lookup
> dentry. Which means that they'd been called by __lookup_slow() or
> lookup_open(), with wq in the call frame of one of those.
>
> [[[
> Result of d_alloc_parallel() in d_add_ci() is fed to
> d_splice_alias(), which *NORMALLY* feeds it to __d_add() or
> __d_move() in a way that will have __d_lookup_done() applied to it.
>
> However, there is a nasty possibility - d_splice_alias() might
> legitimately fail without having marked the sucker not in-lookup. dentry
> will get dropped by d_add_ci(), so ->d_wait won't end up pointing to freed
> object, but it's still a bug - retain_dentry() will scream bloody murder
> upon seeing that, and for a good reason; we'll get hash chain corrupted.
> It's impossible to hit without corrupted fs image (ntfs or case-insensitive
> xfs), but it's a bug. Fix is a one-liner (add d_lookup_done(found);
> right after
> res = d_splice_alias(inode, found);
> if (res) {
> in d_add_ci()) and with that done the last sentence about d_add_ci() turns
> into
> ]]]
>
> Result of d_alloc_parallel() in d_add_ci() is fed to
> d_splice_alias(), which either returns non-NULL (and d_add_ci() does
> d_lookup_done()) or feeds dentry to __d_add() that will do
> __d_lookup_done() under ->d_lock. That concludes the analysis.
>
>
> PS: I'm not sure we need to do this migration of wakeup in stages;
> lift it into the caller of __d_lookup_done() as the first step,
> then move the damn thing all the way to end_dir_add(). Analysis
> can go into either...
PPS: I'm OK with the series in its current form; resplit mentioned above
is a matter of taste and if you prefer to keep the commit boundaries
as-is, I'm fine with that. Commit message really needs to be changed,
though. Would you be OK with the changes as above?
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-07-27 3:36 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-13 14:07 fs/dcache: Resolve the last RT woes Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 1/4] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 2/4] fs/dcache: Split __d_lookup_done() Sebastian Andrzej Siewior
2022-07-27 3:31 ` Al Viro
2022-07-27 3:36 ` Al Viro
2022-06-13 14:07 ` [PATCH 3/4] fs/dcache: Use __d_lookup_unhash() in __d_add/move() Sebastian Andrzej Siewior
2022-07-26 3:00 ` Al Viro
2022-07-26 7:47 ` [PATCH 3/4 v2] " Sebastian Andrzej Siewior
2022-06-13 14:07 ` [PATCH 4/4] fs/dcache: Move wakeup out of i_seq_dir write held region Sebastian Andrzej Siewior
2022-07-08 15:52 ` fs/dcache: Resolve the last RT woes Thomas Gleixner
2022-07-25 15:49 ` Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).