* [PATCH RFC v2 0/6] inode: turn i_state into u32
@ 2024-08-21 15:47 Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 1/6] fs: add i_state helpers Christian Brauner
` (7 more replies)
0 siblings, 8 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-21 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
Christian Brauner, linux-fsdevel
Hey,
So first time I managed to send out my personal wip branch.
So for the record: The intention is to send what's in work.i_state on
vfs/vfs.git.
---
I've recently looked for some free space in struct inode again because
of some exec kerfuffle we recently had and while my idea didn't turn
into anything I noticed that we often waste bytes when using wait bit
operations. So I set out to switch that to another mechanism that would
allow us to free up bytes. So this is an attempt to turn i_state from
an unsigned long into an u32 using the individual bytes of i_state as
addresses for the wait var event mechanism (Thanks to Linus for that idea.).
This survives LTP, xfstests on various filesystems, and will-it-scale.
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Actually send out the correct branch.
- Link to v1: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org
---
---
base-commit: 01e603fb789c75b3a0c63bddd42a42a710da7a52
change-id: 20240820-work-i_state-4e34db39bcf8
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
@ 2024-08-21 15:47 ` Christian Brauner
2024-08-21 22:12 ` NeilBrown
2024-08-21 22:28 ` [PATCH RFC v2 1/6] fs: add i_state helpers Dave Chinner
2024-08-21 15:47 ` [PATCH RFC v2 2/6] fs: reorder i_state bits Christian Brauner
` (6 subsequent siblings)
7 siblings, 2 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-21 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
Christian Brauner, linux-fsdevel
The i_state member is an unsigned long so that it can be used with the
wait bit infrastructure which expects unsigned long. This wastes 4 bytes
which we're unlikely to ever use. Switch to using the var event wait
mechanism using the address of the bit. Thanks to Linus for the address
idea.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/inode.c | 10 ++++++++++
include/linux/fs.h | 16 ++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 154f8689457f..f2a2f6351ec3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
inode->i_state |= I_REFERENCED;
}
+struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
+ struct inode *inode, u32 bit)
+{
+ void *bit_address;
+
+ bit_address = inode_state_wait_address(inode, bit);
+ init_wait_var_entry(wqe, bit_address, 0);
+ return __var_waitqueue(bit_address);
+}
+
/*
* Add inode to LRU if needed (inode is unused and clean).
*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 23e7d46b818a..a5b036714d74 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -744,6 +744,22 @@ struct inode {
void *i_private; /* fs or device private pointer */
} __randomize_layout;
+/*
+ * Get bit address from inode->i_state to use with wait_var_event()
+ * infrastructre.
+ */
+#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
+
+struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
+ struct inode *inode, u32 bit);
+
+static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
+{
+ /* Ensure @bit will be seen cleared/set when caller is woken up. */
+ smp_mb();
+ wake_up_var(inode_state_wait_address(inode, bit));
+}
+
struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
static inline unsigned int i_blocksize(const struct inode *node)
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH RFC v2 2/6] fs: reorder i_state bits
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 1/6] fs: add i_state helpers Christian Brauner
@ 2024-08-21 15:47 ` Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 3/6] writeback: port __I_SYNC to var event Christian Brauner
` (5 subsequent siblings)
7 siblings, 0 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-21 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
Christian Brauner, linux-fsdevel
so that we can use the first bits to derive unique addresses from
i_state.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/fs.h | 38 +++++++++++++++++++++-----------------
1 file changed, 21 insertions(+), 17 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a5b036714d74..8525f8bfd7b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2418,28 +2418,32 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
* i_count.
*
* Q: What is the difference between I_WILL_FREE and I_FREEING?
+ *
+ * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
+ * upon. There's one free address left.
*/
-#define I_DIRTY_SYNC (1 << 0)
-#define I_DIRTY_DATASYNC (1 << 1)
-#define I_DIRTY_PAGES (1 << 2)
-#define __I_NEW 3
+#define __I_NEW 0
#define I_NEW (1 << __I_NEW)
-#define I_WILL_FREE (1 << 4)
-#define I_FREEING (1 << 5)
-#define I_CLEAR (1 << 6)
-#define __I_SYNC 7
+#define __I_SYNC 1
#define I_SYNC (1 << __I_SYNC)
-#define I_REFERENCED (1 << 8)
+#define __I_LRU_ISOLATING 2
+#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
+
+#define I_DIRTY_SYNC (1 << 3)
+#define I_DIRTY_DATASYNC (1 << 4)
+#define I_DIRTY_PAGES (1 << 5)
+#define I_WILL_FREE (1 << 6)
+#define I_FREEING (1 << 7)
+#define I_CLEAR (1 << 8)
+#define I_REFERENCED (1 << 9)
#define I_LINKABLE (1 << 10)
#define I_DIRTY_TIME (1 << 11)
-#define I_WB_SWITCH (1 << 13)
-#define I_OVL_INUSE (1 << 14)
-#define I_CREATING (1 << 15)
-#define I_DONTCACHE (1 << 16)
-#define I_SYNC_QUEUED (1 << 17)
-#define I_PINNING_NETFS_WB (1 << 18)
-#define __I_LRU_ISOLATING 19
-#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
+#define I_WB_SWITCH (1 << 12)
+#define I_OVL_INUSE (1 << 13)
+#define I_CREATING (1 << 14)
+#define I_DONTCACHE (1 << 15)
+#define I_SYNC_QUEUED (1 << 16)
+#define I_PINNING_NETFS_WB (1 << 17)
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH RFC v2 3/6] writeback: port __I_SYNC to var event
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 1/6] fs: add i_state helpers Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 2/6] fs: reorder i_state bits Christian Brauner
@ 2024-08-21 15:47 ` Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 4/6] inode: port __I_NEW " Christian Brauner
` (4 subsequent siblings)
7 siblings, 0 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-21 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
Christian Brauner, linux-fsdevel
Port the __I_SYNC mechanism to use the new var event mechanism.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/fs-writeback.c | 39 +++++++++++++++++++++------------------
1 file changed, 21 insertions(+), 18 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a5006329f6f..3619c86273a4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1389,9 +1389,7 @@ static void inode_sync_complete(struct inode *inode)
inode->i_state &= ~I_SYNC;
/* If inode is clean an unused, put it into LRU now... */
inode_add_lru(inode);
- /* Waiters must see I_SYNC cleared before being woken up */
- smp_mb();
- wake_up_bit(&inode->i_state, __I_SYNC);
+ inode_wake_up_bit(inode, __I_SYNC);
}
static bool inode_dirtied_after(struct inode *inode, unsigned long t)
@@ -1512,17 +1510,21 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
*/
void inode_wait_for_writeback(struct inode *inode)
{
- DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
- wait_queue_head_t *wqh;
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
- lockdep_assert_held(&inode->i_lock);
- wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
- while (inode->i_state & I_SYNC) {
- spin_unlock(&inode->i_lock);
- __wait_on_bit(wqh, &wq, bit_wait,
- TASK_UNINTERRUPTIBLE);
- spin_lock(&inode->i_lock);
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
+ for (;;) {
+ prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
+ if (inode->i_state & I_SYNC) {
+ spin_unlock(&inode->i_lock);
+ schedule();
+ spin_lock(&inode->i_lock);
+ continue;
+ }
+ break;
}
+ finish_wait(wq_head, &wqe.wq_entry);
}
/*
@@ -1533,16 +1535,17 @@ void inode_wait_for_writeback(struct inode *inode)
static void inode_sleep_on_writeback(struct inode *inode)
__releases(inode->i_lock)
{
- DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
- int sleep;
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
+ bool sleep;
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
- sleep = inode->i_state & I_SYNC;
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
+ prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
+ sleep = !!(inode->i_state & I_SYNC);
spin_unlock(&inode->i_lock);
if (sleep)
schedule();
- finish_wait(wqh, &wait);
+ finish_wait(wq_head, &wqe.wq_entry);
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH RFC v2 4/6] inode: port __I_NEW to var event
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
` (2 preceding siblings ...)
2024-08-21 15:47 ` [PATCH RFC v2 3/6] writeback: port __I_SYNC to var event Christian Brauner
@ 2024-08-21 15:47 ` Christian Brauner
2024-08-23 0:31 ` NeilBrown
2024-08-21 15:47 ` [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
` (3 subsequent siblings)
7 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-21 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
Christian Brauner, linux-fsdevel
Port the __I_NEW mechanism to use the new var event mechanism.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/bcachefs/fs.c | 10 ++++++----
fs/dcache.c | 3 +--
fs/inode.c | 18 ++++++++----------
include/linux/writeback.h | 3 ++-
4 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 94c392abef65..c0900c0c0f8a 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -1644,14 +1644,16 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
break;
}
} else if (clean_pass && this_pass_clean) {
- wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
- DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
- prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+ wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW);
+ prepare_to_wait_event(wq_head, &wqe.wq_entry,
+ TASK_UNINTERRUPTIBLE);
mutex_unlock(&c->vfs_inodes_lock);
schedule();
- finish_wait(wq, &wait.wq_entry);
+ finish_wait(wq_head, &wqe.wq_entry);
goto again;
}
}
diff --git a/fs/dcache.c b/fs/dcache.c
index 1af75fa68638..7037f9312ed4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1908,8 +1908,7 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
__d_instantiate(entry, inode);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW & ~I_CREATING;
- smp_mb();
- wake_up_bit(&inode->i_state, __I_NEW);
+ inode_wake_up_bit(inode, __I_NEW);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_instantiate_new);
diff --git a/fs/inode.c b/fs/inode.c
index f2a2f6351ec3..d18e1567c487 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -733,7 +733,7 @@ static void evict(struct inode *inode)
* used as an indicator whether blocking on it is safe.
*/
spin_lock(&inode->i_lock);
- wake_up_bit(&inode->i_state, __I_NEW);
+ inode_wake_up_bit(inode, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
spin_unlock(&inode->i_lock);
@@ -1141,8 +1141,7 @@ void unlock_new_inode(struct inode *inode)
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW & ~I_CREATING;
- smp_mb();
- wake_up_bit(&inode->i_state, __I_NEW);
+ inode_wake_up_bit(inode, __I_NEW);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(unlock_new_inode);
@@ -1153,8 +1152,7 @@ void discard_new_inode(struct inode *inode)
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW;
- smp_mb();
- wake_up_bit(&inode->i_state, __I_NEW);
+ inode_wake_up_bit(inode, __I_NEW);
spin_unlock(&inode->i_lock);
iput(inode);
}
@@ -2343,8 +2341,8 @@ EXPORT_SYMBOL(inode_needs_sync);
*/
static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked)
{
- wait_queue_head_t *wq;
- DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
/*
* Handle racing against evict(), see that routine for more details.
@@ -2355,14 +2353,14 @@ static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_lock
return;
}
- wq = bit_waitqueue(&inode->i_state, __I_NEW);
- prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW);
+ prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
if (is_inode_hash_locked)
spin_unlock(&inode_hash_lock);
schedule();
- finish_wait(wq, &wait.wq_entry);
+ finish_wait(wq_head, &wqe.wq_entry);
if (is_inode_hash_locked)
spin_lock(&inode_hash_lock);
rcu_read_lock();
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 56b85841ae4c..bed795b8340b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -200,7 +200,8 @@ void inode_io_list_del(struct inode *inode);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
{
- wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE);
+ wait_var_event(inode_state_wait_address(inode, __I_NEW),
+ !(inode->i_state & I_NEW));
}
#ifdef CONFIG_CGROUP_WRITEBACK
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
` (3 preceding siblings ...)
2024-08-21 15:47 ` [PATCH RFC v2 4/6] inode: port __I_NEW " Christian Brauner
@ 2024-08-21 15:47 ` Christian Brauner
2024-08-21 19:41 ` Jeff Layton
2024-08-23 0:36 ` NeilBrown
2024-08-21 15:47 ` [PATCH RFC v2 6/6] inode: make i_state a u32 Christian Brauner
` (2 subsequent siblings)
7 siblings, 2 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-21 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
Christian Brauner, linux-fsdevel
Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/inode.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index d18e1567c487..c8a5c63dc980 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -510,8 +510,7 @@ static void inode_unpin_lru_isolating(struct inode *inode)
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
inode->i_state &= ~I_LRU_ISOLATING;
- smp_mb();
- wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
+ inode_wake_up_bit(inode, __I_LRU_ISOLATING);
spin_unlock(&inode->i_lock);
}
@@ -519,13 +518,22 @@ static void inode_wait_for_lru_isolating(struct inode *inode)
{
lockdep_assert_held(&inode->i_lock);
if (inode->i_state & I_LRU_ISOLATING) {
- DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
- wait_queue_head_t *wqh;
-
- wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
- spin_unlock(&inode->i_lock);
- __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
- spin_lock(&inode->i_lock);
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
+
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
+ for (;;) {
+ prepare_to_wait_event(wq_head, &wqe.wq_entry,
+ TASK_UNINTERRUPTIBLE);
+ if (inode->i_state & I_LRU_ISOLATING) {
+ spin_unlock(&inode->i_lock);
+ schedule();
+ spin_lock(&inode->i_lock);
+ continue;
+ }
+ break;
+ }
+ finish_wait(wq_head, &wqe.wq_entry);
WARN_ON(inode->i_state & I_LRU_ISOLATING);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH RFC v2 6/6] inode: make i_state a u32
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
` (4 preceding siblings ...)
2024-08-21 15:47 ` [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
@ 2024-08-21 15:47 ` Christian Brauner
2024-08-21 21:03 ` Andreas Dilger
2024-08-21 19:15 ` [PATCH RFC v2 0/6] inode: turn i_state into u32 Josef Bacik
2024-08-21 19:42 ` Jeff Layton
7 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-21 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
Christian Brauner, linux-fsdevel
Now that we use the wait var event mechanism make i_state a u32 and free
up 4 bytes. This means we currently have two 4 byte holes in struct
inode which we can pack.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8525f8bfd7b9..a673173b6896 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -681,7 +681,7 @@ struct inode {
#endif
/* Misc */
- unsigned long i_state;
+ u32 i_state;
struct rw_semaphore i_rwsem;
unsigned long dirtied_when; /* jiffies of first dirtying */
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 0/6] inode: turn i_state into u32
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
` (5 preceding siblings ...)
2024-08-21 15:47 ` [PATCH RFC v2 6/6] inode: make i_state a u32 Christian Brauner
@ 2024-08-21 19:15 ` Josef Bacik
2024-08-21 19:42 ` Jeff Layton
7 siblings, 0 replies; 49+ messages in thread
From: Josef Bacik @ 2024-08-21 19:15 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Wed, Aug 21, 2024 at 05:47:30PM +0200, Christian Brauner wrote:
> Hey,
>
> So first time I managed to send out my personal wip branch.
> So for the record: The intention is to send what's in work.i_state on
> vfs/vfs.git.
>
> ---
> I've recently looked for some free space in struct inode again because
> of some exec kerfuffle we recently had and while my idea didn't turn
> into anything I noticed that we often waste bytes when using wait bit
> operations. So I set out to switch that to another mechanism that would
> allow us to free up bytes. So this is an attempt to turn i_state from
> an unsigned long into an u32 using the individual bytes of i_state as
> addresses for the wait var event mechanism (Thanks to Linus for that idea.).
>
> This survives LTP, xfstests on various filesystems, and will-it-scale.
>
> To: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
>
> ---
> Changes in v2:
> - Actually send out the correct branch.
> - Link to v1: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org
>
You can add
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
to the series, thanks,
Josef
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-21 15:47 ` [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
@ 2024-08-21 19:41 ` Jeff Layton
2024-08-22 8:53 ` Christian Brauner
2024-08-23 0:36 ` NeilBrown
1 sibling, 1 reply; 49+ messages in thread
From: Jeff Layton @ 2024-08-21 19:41 UTC (permalink / raw)
To: Christian Brauner, Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jan Kara, linux-fsdevel
On Wed, 2024-08-21 at 17:47 +0200, Christian Brauner wrote:
> Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/inode.c | 26 +++++++++++++++++---------
> 1 file changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index d18e1567c487..c8a5c63dc980 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -510,8 +510,7 @@ static void inode_unpin_lru_isolating(struct inode *inode)
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
> inode->i_state &= ~I_LRU_ISOLATING;
> - smp_mb();
> - wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
> + inode_wake_up_bit(inode, __I_LRU_ISOLATING);
> spin_unlock(&inode->i_lock);
> }
>
> @@ -519,13 +518,22 @@ static void inode_wait_for_lru_isolating(struct inode *inode)
> {
> lockdep_assert_held(&inode->i_lock);
> if (inode->i_state & I_LRU_ISOLATING) {
> - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
> - wait_queue_head_t *wqh;
> -
> - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
> - spin_unlock(&inode->i_lock);
> - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
> - spin_lock(&inode->i_lock);
> + struct wait_bit_queue_entry wqe;
> + struct wait_queue_head *wq_head;
> +
> + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
> + for (;;) {
> + prepare_to_wait_event(wq_head, &wqe.wq_entry,
> + TASK_UNINTERRUPTIBLE);
> + if (inode->i_state & I_LRU_ISOLATING) {
> + spin_unlock(&inode->i_lock);
> + schedule();
> + spin_lock(&inode->i_lock);
> + continue;
> + }
> + break;
> + }
nit: personally, I'd prefer this, since you wouldn't need the brackets
or the continue:
if (!(inode->i_state & LRU_ISOLATING))
break;
spin_unlock();
schedule();
...
> + finish_wait(wq_head, &wqe.wq_entry);
> WARN_ON(inode->i_state & I_LRU_ISOLATING);
> }
> }
>
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 0/6] inode: turn i_state into u32
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
` (6 preceding siblings ...)
2024-08-21 19:15 ` [PATCH RFC v2 0/6] inode: turn i_state into u32 Josef Bacik
@ 2024-08-21 19:42 ` Jeff Layton
7 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2024-08-21 19:42 UTC (permalink / raw)
To: Christian Brauner, Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jan Kara, linux-fsdevel
On Wed, 2024-08-21 at 17:47 +0200, Christian Brauner wrote:
> Hey,
>
> So first time I managed to send out my personal wip branch.
> So for the record: The intention is to send what's in work.i_state on
> vfs/vfs.git.
>
> ---
> I've recently looked for some free space in struct inode again
> because
> of some exec kerfuffle we recently had and while my idea didn't turn
> into anything I noticed that we often waste bytes when using wait bit
> operations. So I set out to switch that to another mechanism that
> would
> allow us to free up bytes. So this is an attempt to turn i_state from
> an unsigned long into an u32 using the individual bytes of i_state as
> addresses for the wait var event mechanism (Thanks to Linus for that
> idea.).
>
> This survives LTP, xfstests on various filesystems, and will-it-
> scale.
>
> To: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
>
> ---
> Changes in v2:
> - Actually send out the correct branch.
> - Link to v1:
> https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org
>
> ---
>
>
>
> ---
> base-commit: 01e603fb789c75b3a0c63bddd42a42a710da7a52
> change-id: 20240820-work-i_state-4e34db39bcf8
>
This all looks good to me, modulo one minor nit in patch #5 that you
can take or leave.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 6/6] inode: make i_state a u32
2024-08-21 15:47 ` [PATCH RFC v2 6/6] inode: make i_state a u32 Christian Brauner
@ 2024-08-21 21:03 ` Andreas Dilger
2024-08-22 8:31 ` Christian Brauner
0 siblings, 1 reply; 49+ messages in thread
From: Andreas Dilger @ 2024-08-21 21:03 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]
On Aug 21, 2024, at 9:47 AM, Christian Brauner <brauner@kernel.org> wrote:
>
> Now that we use the wait var event mechanism make i_state a u32 and free
> up 4 bytes. This means we currently have two 4 byte holes in struct
> inode which we can pack.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> include/linux/fs.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8525f8bfd7b9..a673173b6896 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -681,7 +681,7 @@ struct inode {
> #endif
>
> /* Misc */
> - unsigned long i_state;
> + u32 i_state;
Is it worthwhile to add a comment that there is a hole here, instead
of leaving it for future re-discovery?
/* 32-bit hole */
> struct rw_semaphore i_rwsem;
>
> unsigned long dirtied_when; /* jiffies of first dirtying */
Normally this would be excess noise, but since struct inode is
micro-optimized (like struct page) it probably makes sense to
document this case.
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 15:47 ` [PATCH RFC v2 1/6] fs: add i_state helpers Christian Brauner
@ 2024-08-21 22:12 ` NeilBrown
2024-08-21 22:47 ` Linus Torvalds
` (2 more replies)
2024-08-21 22:28 ` [PATCH RFC v2 1/6] fs: add i_state helpers Dave Chinner
1 sibling, 3 replies; 49+ messages in thread
From: NeilBrown @ 2024-08-21 22:12 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, Christian Brauner, linux-fsdevel
On Thu, 22 Aug 2024, Christian Brauner wrote:
> The i_state member is an unsigned long so that it can be used with the
> wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> which we're unlikely to ever use. Switch to using the var event wait
> mechanism using the address of the bit. Thanks to Linus for the address
> idea.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/inode.c | 10 ++++++++++
> include/linux/fs.h | 16 ++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 154f8689457f..f2a2f6351ec3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> inode->i_state |= I_REFERENCED;
> }
>
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> + struct inode *inode, u32 bit)
> +{
> + void *bit_address;
> +
> + bit_address = inode_state_wait_address(inode, bit);
> + init_wait_var_entry(wqe, bit_address, 0);
> + return __var_waitqueue(bit_address);
> +}
> +
> /*
> * Add inode to LRU if needed (inode is unused and clean).
> *
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 23e7d46b818a..a5b036714d74 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -744,6 +744,22 @@ struct inode {
> void *i_private; /* fs or device private pointer */
> } __randomize_layout;
>
> +/*
> + * Get bit address from inode->i_state to use with wait_var_event()
> + * infrastructre.
> + */
> +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> +
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> + struct inode *inode, u32 bit);
> +
> +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> +{
> + /* Ensure @bit will be seen cleared/set when caller is woken up. */
The above comment is wrong. I think I once thought it was correct too
but now I know better (I hope).
A better comment might be
/* Insert memory barrier as recommended by wake_up_var() */
but even that is unnecessary as we don't need the memory barrier.
A careful reading of memory-barriers.rst shows that *when the process is
actually woken* there are sufficient barriers in wake_up_process() and
prepare_wait_event() and the scheduler and (particularly)
set_current_state() so that a value set before the wake_up is seen after
the schedule().
So this barrier isn't about the bit. This barrier is about the
wait_queue. In particular it is about waitqueue_active() call at the
start of wake_up_var(). If that test wasn't there and if instead
wake_up_var() conditionally called __wake_up(), then there would be no
need for any barrier. A comment near wake_up_bit() makes it clear that
the barrier is only needed when the spinlock is avoided.
On a weakly ordered arch, this test can be effected *before* the write
of the bit. If the waiter adds itself to the wait queue and then tests
the bit before the bit is set, but after the waitqueue_active() test is
put in effect, then the wake_up will never be sent.
But .... this is all academic of this code because you don't need a
barrier at all. The wake_up happens in a spin_locked region, and the
wait is entirely inside the same spin_lock, except for the schedule. A
later patch has:
spin_unlock(); schedule(); spin_lock();
So there is no room for a race. If the bit is cleared before the
wait_var_event() equivalent, then no wakeup is needed. When the lock is
dropped after the bit is cleared the unlock will have all the barriers
needed for the bit to be visible.
The only other place that the bit can be cleared is concurrent with the
above schedule() while the spinlock isn't held by the waiter. In that
case it is again clear that no barrier is needed - or that the
spin_unlock/lock provide all the needed barriers.
So a better comment would be
/* no barrier needs as both waker and waiter are in spin-locked regions */
Thanks,
NeilBrown
> + smp_mb();
> + wake_up_var(inode_state_wait_address(inode, bit));
> +}
> +
> struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
>
> static inline unsigned int i_blocksize(const struct inode *node)
>
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 15:47 ` [PATCH RFC v2 1/6] fs: add i_state helpers Christian Brauner
2024-08-21 22:12 ` NeilBrown
@ 2024-08-21 22:28 ` Dave Chinner
2024-08-21 22:53 ` Linus Torvalds
1 sibling, 1 reply; 49+ messages in thread
From: Dave Chinner @ 2024-08-21 22:28 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Wed, Aug 21, 2024 at 05:47:31PM +0200, Christian Brauner wrote:
> The i_state member is an unsigned long so that it can be used with the
> wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> which we're unlikely to ever use. Switch to using the var event wait
> mechanism using the address of the bit. Thanks to Linus for the address
> idea.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/inode.c | 10 ++++++++++
> include/linux/fs.h | 16 ++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 154f8689457f..f2a2f6351ec3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> inode->i_state |= I_REFERENCED;
> }
>
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> + struct inode *inode, u32 bit)
> +{
> + void *bit_address;
> +
> + bit_address = inode_state_wait_address(inode, bit);
> + init_wait_var_entry(wqe, bit_address, 0);
> + return __var_waitqueue(bit_address);
> +}
> +
> /*
> * Add inode to LRU if needed (inode is unused and clean).
> *
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 23e7d46b818a..a5b036714d74 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -744,6 +744,22 @@ struct inode {
> void *i_private; /* fs or device private pointer */
> } __randomize_layout;
>
> +/*
> + * Get bit address from inode->i_state to use with wait_var_event()
> + * infrastructre.
> + */
> +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> +
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> + struct inode *inode, u32 bit);
> +
> +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> +{
> + /* Ensure @bit will be seen cleared/set when caller is woken up. */
> + smp_mb();
> + wake_up_var(inode_state_wait_address(inode, bit));
> +}
Why is this memory barrier necessary?
wakeup_var() takes the wait queue head spinlock, as does
prepare_to_wait() before the waiter condition check and the
subsequent finish_wait() when the wait is done.
Hence by the time the functions like inode_wait_for_writeback()
return to the caller, there's been a full unlock->lock cycle on the
wait queue head lock between the bit being set and the waiter waking
and checking it.
ie. the use of the generic waitqueue infrastructure implies a full
memory barrier between the bit being set, the wakeup being issued
and the waiter checking the condition bit.
Hence I'm not sure what race condition this memory barrier is
preventing - is this just a historical artifact that predates the
current wait/wake implementation?
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 22:12 ` NeilBrown
@ 2024-08-21 22:47 ` Linus Torvalds
2024-08-21 23:34 ` Dave Chinner
2024-08-22 8:27 ` Christian Brauner
2 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2024-08-21 22:47 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Thu, 22 Aug 2024 at 06:12, NeilBrown <neilb@suse.de> wrote:
>
> So this barrier isn't about the bit. This barrier is about the
> wait_queue. In particular it is about waitqueue_active() call
Correct. The waitqueue_active() call is basically very racy, but it's
also often a huge deal for performance.
In *most* cases, wakeups happen with nobody waiting for them at all,
and taking the waitqueue spinlock just to see that there are no
waiters is hugely expensive and easily causes lots of pointless cache
ping-pong.
So the waitqueue_active() is there to basically say "if there are no
waiters, don't bother doing anything".
But *because* it is done - by definition - without holding the
waitqueue lock, it means that a new waiter might have just checked for
the condition, and is about to add itself to the wait queue.
Now, waiters will then re-check *after* they have added themselves to
the waitqueue too (that's what all the wait_for_event etc stuff does),
and the waiter side will have barriers thanks to the spinlocks on the
waitqueue.
But the waker still needs to make sure it's ordered wrt "I changed the
condition that the waiter is waiting for" and the lockless
waitqueue_active() test.
If waiters and wakers are mutually serialized by some external lock,
there are no races.
But commonly, the waker will just do an atomic op, or maybe it holds a
lock (as a writer) that the waiting side does not hold, and the
waiting side just does a "READ_ONCE()" or a "smp_load_acquire()" or
whatever.
So the waker needs to make sure that its state setting is serialized
with the waiter. If it uses our atomics, then typically a
smp_mb__after_atomic() is appropriate (which often ends up being a
no-op).
But commonly you end up needing a smp_mb(), which serializes whatever
action the waker did with the waker then doing that optimistic
lockless waitqueue_active().
> /* no barrier needs as both waker and waiter are in spin-locked regions */
If all the inode state waiters do indeed hold the lock, then that is fine.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 22:28 ` [PATCH RFC v2 1/6] fs: add i_state helpers Dave Chinner
@ 2024-08-21 22:53 ` Linus Torvalds
0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2024-08-21 22:53 UTC (permalink / raw)
To: Dave Chinner
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Thu, 22 Aug 2024 at 06:28, Dave Chinner <david@fromorbit.com> wrote:
>
> wakeup_var() takes the wait queue head spinlock
That's the part you missed.
wake_up_var() does no such thing. It calles __wake_up_bit(), which does this:
if (waitqueue_active(wq_head))
__wake_up(wq_head, TASK_NORMAL, 1, &key);
and that waitqueue_active() is lockless.
That's the thing that wants the barrier (or something else that
guarantees that the waiters actually *see* the new state after they've
added themselves to the wait-list, and before we then look at the
wait-list being empty).
See the comment in <linux/wait.h> above the waitqueue_active().
And yes, this is annoying and subtle and has caused bugs. But taking
waitqueue locks just to notice nobody is waiting is very expensive
indeed, and much more expensive than a memory barrier when the common
case is typically often "nobody is waiting".
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 22:12 ` NeilBrown
2024-08-21 22:47 ` Linus Torvalds
@ 2024-08-21 23:34 ` Dave Chinner
2024-08-23 0:08 ` NeilBrown
2024-08-22 8:27 ` Christian Brauner
2 siblings, 1 reply; 49+ messages in thread
From: Dave Chinner @ 2024-08-21 23:34 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Linus Torvalds, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Thu, Aug 22, 2024 at 08:12:15AM +1000, NeilBrown wrote:
> On Thu, 22 Aug 2024, Christian Brauner wrote:
> > The i_state member is an unsigned long so that it can be used with the
> > wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> > which we're unlikely to ever use. Switch to using the var event wait
> > mechanism using the address of the bit. Thanks to Linus for the address
> > idea.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/inode.c | 10 ++++++++++
> > include/linux/fs.h | 16 ++++++++++++++++
> > 2 files changed, 26 insertions(+)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 154f8689457f..f2a2f6351ec3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> > inode->i_state |= I_REFERENCED;
> > }
> >
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > + struct inode *inode, u32 bit)
> > +{
> > + void *bit_address;
> > +
> > + bit_address = inode_state_wait_address(inode, bit);
> > + init_wait_var_entry(wqe, bit_address, 0);
> > + return __var_waitqueue(bit_address);
> > +}
> > +
> > /*
> > * Add inode to LRU if needed (inode is unused and clean).
> > *
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 23e7d46b818a..a5b036714d74 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -744,6 +744,22 @@ struct inode {
> > void *i_private; /* fs or device private pointer */
> > } __randomize_layout;
> >
> > +/*
> > + * Get bit address from inode->i_state to use with wait_var_event()
> > + * infrastructre.
> > + */
> > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> > +
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > + struct inode *inode, u32 bit);
> > +
> > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> > +{
> > + /* Ensure @bit will be seen cleared/set when caller is woken up. */
>
> The above comment is wrong. I think I once thought it was correct too
> but now I know better (I hope).
> A better comment might be
> /* Insert memory barrier as recommended by wake_up_var() */
> but even that is unnecessary as we don't need the memory barrier.
>
> A careful reading of memory-barriers.rst shows that *when the process is
> actually woken* there are sufficient barriers in wake_up_process() and
> prepare_wait_event() and the scheduler and (particularly)
> set_current_state() so that a value set before the wake_up is seen after
> the schedule().
>
> So this barrier isn't about the bit. This barrier is about the
> wait_queue. In particular it is about waitqueue_active() call at the
> start of wake_up_var(). If that test wasn't there and if instead
> wake_up_var() conditionally called __wake_up(), then there would be no
> need for any barrier. A comment near wake_up_bit() makes it clear that
> the barrier is only needed when the spinlock is avoided.
Oh, I missed that *landmine*.
So almost none of the current uses of wake_up_var() have an explicit
memory barrier before them, and a lot of them are not done under
spin locks. I suspect that the implicit store ordering of the wake
queue check is mostly only handled by chance - an
atomic_dec_and_test() or smp_store_release() call is made before
wake_up_var() or the wait/wakeup is synchronised by an external
lock.
> On a weakly ordered arch, this test can be effected *before* the write
> of the bit. If the waiter adds itself to the wait queue and then tests
> the bit before the bit is set, but after the waitqueue_active() test is
> put in effect, then the wake_up will never be sent.
>
> But .... this is all academic of this code because you don't need a
> barrier at all. The wake_up happens in a spin_locked region, and the
> wait is entirely inside the same spin_lock, except for the schedule. A
> later patch has:
> spin_unlock(); schedule(); spin_lock();
>
> So there is no room for a race. If the bit is cleared before the
> wait_var_event() equivalent, then no wakeup is needed. When the lock is
> dropped after the bit is cleared the unlock will have all the barriers
> needed for the bit to be visible.
Right, that's exactly the problem with infrastructure that
externalises memory ordering dependencies. Most of the time they
just work because of external factors, but sometimes they don't and
we get random memory barriers being cargo culted around the place
because that seems to fix weird problems....
> The only other place that the bit can be cleared is concurrent with the
> above schedule() while the spinlock isn't held by the waiter. In that
> case it is again clear that no barrier is needed - or that the
> spin_unlock/lock provide all the needed barriers.
>
> So a better comment would be
>
> /* no barrier needs as both waker and waiter are in spin-locked regions */
^^^^^^^
ordering
Even better would be to fix wake_up_var() to not have an implicit
ordering requirement. Add __wake_up_var() as the "I know what I'm
doing" API, and have wake_up_var() always issue the memory barrier
like so:
__wake_up_var(var)
{
__wake_up_bit(....);
}
wake_up_var(var)
{
smp_mb();
__wake_up_var(var);
}
Then people who don't intimately understand ordering (i.e. just want
to use a conditional wait variable) or just don't want to
think about complex memory ordering issues for otherwise obvious and
simple code can simply use the safe variant.
Those that need to optimise for speed or know they have solid
ordering or external serialisation can use __wake_up_var() and leave
a comment as to why that is safe. i.e.:
/*
* No wait/wakeup ordering required as both waker and waiters
* are serialised by the inode->i_lock.
*/
__wake_up_var(....);
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 22:12 ` NeilBrown
2024-08-21 22:47 ` Linus Torvalds
2024-08-21 23:34 ` Dave Chinner
@ 2024-08-22 8:27 ` Christian Brauner
2024-08-22 8:37 ` Linus Torvalds
2024-08-23 0:14 ` NeilBrown
2 siblings, 2 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-22 8:27 UTC (permalink / raw)
To: NeilBrown
Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Thu, Aug 22, 2024 at 08:12:15AM GMT, NeilBrown wrote:
> On Thu, 22 Aug 2024, Christian Brauner wrote:
> > The i_state member is an unsigned long so that it can be used with the
> > wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> > which we're unlikely to ever use. Switch to using the var event wait
> > mechanism using the address of the bit. Thanks to Linus for the address
> > idea.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/inode.c | 10 ++++++++++
> > include/linux/fs.h | 16 ++++++++++++++++
> > 2 files changed, 26 insertions(+)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 154f8689457f..f2a2f6351ec3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> > inode->i_state |= I_REFERENCED;
> > }
> >
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > + struct inode *inode, u32 bit)
> > +{
> > + void *bit_address;
> > +
> > + bit_address = inode_state_wait_address(inode, bit);
> > + init_wait_var_entry(wqe, bit_address, 0);
> > + return __var_waitqueue(bit_address);
> > +}
> > +
> > /*
> > * Add inode to LRU if needed (inode is unused and clean).
> > *
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 23e7d46b818a..a5b036714d74 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -744,6 +744,22 @@ struct inode {
> > void *i_private; /* fs or device private pointer */
> > } __randomize_layout;
> >
> > +/*
> > + * Get bit address from inode->i_state to use with wait_var_event()
> > + * infrastructre.
> > + */
> > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> > +
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > + struct inode *inode, u32 bit);
> > +
> > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> > +{
> > + /* Ensure @bit will be seen cleared/set when caller is woken up. */
>
> The above comment is wrong. I think I once thought it was correct too
> but now I know better (I hope).
> A better comment might be
> /* Insert memory barrier as recommended by wake_up_var() */
> but even that is unnecessary as we don't need the memory barrier.
>
> A careful reading of memory-barriers.rst shows that *when the process is
> actually woken* there are sufficient barriers in wake_up_process() and
> prepare_wait_event() and the scheduler and (particularly)
> set_current_state() so that a value set before the wake_up is seen after
> the schedule().
>
> So this barrier isn't about the bit. This barrier is about the
> wait_queue. In particular it is about waitqueue_active() call at the
> start of wake_up_var(). If that test wasn't there and if instead
> wake_up_var() conditionally called __wake_up(), then there would be no
Did you mean "unconditionally called"?
> need for any barrier. A comment near wake_up_bit() makes it clear that
> the barrier is only needed when the spinlock is avoided.
>
> On a weakly ordered arch, this test can be effected *before* the write
> of the bit. If the waiter adds itself to the wait queue and then tests
> the bit before the bit is set, but after the waitqueue_active() test is
> put in effect, then the wake_up will never be sent.
>
> But .... this is all academic of this code because you don't need a
> barrier at all. The wake_up happens in a spin_locked region, and the
> wait is entirely inside the same spin_lock, except for the schedule. A
> later patch has:
> spin_unlock(); schedule(); spin_lock();
>
> So there is no room for a race. If the bit is cleared before the
> wait_var_event() equivalent, then no wakeup is needed. When the lock is
> dropped after the bit is cleared the unlock will have all the barriers
> needed for the bit to be visible.
>
> The only other place that the bit can be cleared is concurrent with the
> above schedule() while the spinlock isn't held by the waiter. In that
> case it is again clear that no barrier is needed - or that the
> spin_unlock/lock provide all the needed barriers.
>
> So a better comment would be
>
> /* no barrier needs as both waker and waiter are in spin-locked regions */
Thanks for the analysis. I was under the impression that wait_on_inode()
was called in contexts where no barrier is guaranteed and the bit isn't
checked with spin_lock() held.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 6/6] inode: make i_state a u32
2024-08-21 21:03 ` Andreas Dilger
@ 2024-08-22 8:31 ` Christian Brauner
0 siblings, 0 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-22 8:31 UTC (permalink / raw)
To: Andreas Dilger
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Wed, Aug 21, 2024 at 03:03:01PM GMT, Andreas Dilger wrote:
> On Aug 21, 2024, at 9:47 AM, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Now that we use the wait var event mechanism make i_state a u32 and free
> > up 4 bytes. This means we currently have two 4 byte holes in struct
> > inode which we can pack.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > include/linux/fs.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8525f8bfd7b9..a673173b6896 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -681,7 +681,7 @@ struct inode {
> > #endif
> >
> > /* Misc */
> > - unsigned long i_state;
> > + u32 i_state;
>
> Is it worthwhile to add a comment that there is a hole here, instead
> of leaving it for future re-discovery?
>
> /* 32-bit hole */
>
> > struct rw_semaphore i_rwsem;
> >
> > unsigned long dirtied_when; /* jiffies of first dirtying */
>
> Normally this would be excess noise, but since struct inode is
> micro-optimized (like struct page) it probably makes sense to
> document this case.
Good idea. Added now!
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-22 8:27 ` Christian Brauner
@ 2024-08-22 8:37 ` Linus Torvalds
2024-08-23 0:14 ` NeilBrown
1 sibling, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2024-08-22 8:37 UTC (permalink / raw)
To: Christian Brauner
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
linux-fsdevel
On Thu, 22 Aug 2024 at 16:27, Christian Brauner <brauner@kernel.org> wrote:
>
> >
> > /* no barrier needs as both waker and waiter are in spin-locked regions */
>
> Thanks for the analysis. I was under the impression that wait_on_inode()
> was called in contexts where no barrier is guaranteed and the bit isn't
> checked with spin_lock() held.
Yes, it does look like the barrier is needed, because the waiter does
not hold the lock indeed.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-21 19:41 ` Jeff Layton
@ 2024-08-22 8:53 ` Christian Brauner
2024-08-22 9:48 ` Mateusz Guzik
0 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-22 8:53 UTC (permalink / raw)
To: Jeff Layton
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar, Jan Kara,
linux-fsdevel
On Wed, Aug 21, 2024 at 03:41:45PM GMT, Jeff Layton wrote:
> On Wed, 2024-08-21 at 17:47 +0200, Christian Brauner wrote:
> > Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/inode.c | 26 +++++++++++++++++---------
> > 1 file changed, 17 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index d18e1567c487..c8a5c63dc980 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -510,8 +510,7 @@ static void inode_unpin_lru_isolating(struct inode *inode)
> > spin_lock(&inode->i_lock);
> > WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
> > inode->i_state &= ~I_LRU_ISOLATING;
> > - smp_mb();
> > - wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
> > + inode_wake_up_bit(inode, __I_LRU_ISOLATING);
> > spin_unlock(&inode->i_lock);
> > }
> >
> > @@ -519,13 +518,22 @@ static void inode_wait_for_lru_isolating(struct inode *inode)
> > {
> > lockdep_assert_held(&inode->i_lock);
> > if (inode->i_state & I_LRU_ISOLATING) {
> > - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
> > - wait_queue_head_t *wqh;
> > -
> > - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
> > - spin_unlock(&inode->i_lock);
> > - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
> > - spin_lock(&inode->i_lock);
> > + struct wait_bit_queue_entry wqe;
> > + struct wait_queue_head *wq_head;
> > +
> > + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
> > + for (;;) {
> > + prepare_to_wait_event(wq_head, &wqe.wq_entry,
> > + TASK_UNINTERRUPTIBLE);
> > + if (inode->i_state & I_LRU_ISOLATING) {
> > + spin_unlock(&inode->i_lock);
> > + schedule();
> > + spin_lock(&inode->i_lock);
> > + continue;
> > + }
> > + break;
> > + }
>
> nit: personally, I'd prefer this, since you wouldn't need the brackets
> or the continue:
>
> if (!(inode->i_state & LRU_ISOLATING))
> break;
> spin_unlock();
> schedule();
Yeah, that's nicer. I'll use that.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-22 8:53 ` Christian Brauner
@ 2024-08-22 9:48 ` Mateusz Guzik
2024-08-22 11:10 ` Christian Brauner
0 siblings, 1 reply; 49+ messages in thread
From: Mateusz Guzik @ 2024-08-22 9:48 UTC (permalink / raw)
To: Christian Brauner
Cc: Jeff Layton, Linus Torvalds, NeilBrown, Peter Zijlstra,
Ingo Molnar, Jan Kara, linux-fsdevel
On Thu, Aug 22, 2024 at 10:53:50AM +0200, Christian Brauner wrote:
> On Wed, Aug 21, 2024 at 03:41:45PM GMT, Jeff Layton wrote:
> > > if (inode->i_state & I_LRU_ISOLATING) {
> > > - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
> > > - wait_queue_head_t *wqh;
> > > -
> > > - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
> > > - spin_unlock(&inode->i_lock);
> > > - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
> > > - spin_lock(&inode->i_lock);
> > > + struct wait_bit_queue_entry wqe;
> > > + struct wait_queue_head *wq_head;
> > > +
> > > + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
> > > + for (;;) {
> > > + prepare_to_wait_event(wq_head, &wqe.wq_entry,
> > > + TASK_UNINTERRUPTIBLE);
> > > + if (inode->i_state & I_LRU_ISOLATING) {
> > > + spin_unlock(&inode->i_lock);
> > > + schedule();
> > > + spin_lock(&inode->i_lock);
> > > + continue;
> > > + }
> > > + break;
> > > + }
> >
> > nit: personally, I'd prefer this, since you wouldn't need the brackets
> > or the continue:
> >
> > if (!(inode->i_state & LRU_ISOLATING))
> > break;
> > spin_unlock();
> > schedule();
>
> Yeah, that's nicer. I'll use that.
In that case may I suggest also short-circuiting the entire func?
if (!(inode->i_state & I_LRU_ISOLATING))
return;
then the entire thing loses one indentation level
Same thing is applicable to the I_SYNC flag.
A non-cosmetic is as follows: is it legal for a wake up to happen
spuriously?
For legal waiters on the flag I_FREEING is set which prevents
I_LRU_ISOLATING from showing up afterwards. Thus it should be sufficient
to wait for the flag to clear once. This is moot if spurious wakeups do
happen.
If looping is needed, then this warn:
WARN_ON(inode->i_state & I_LRU_ISOLATING);
fails to test for anything since the loop is on that very condition (iow
it needs to be whacked)
The writeback code needs to loop because there are callers outside of
evict().
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-22 9:48 ` Mateusz Guzik
@ 2024-08-22 11:10 ` Christian Brauner
2024-08-22 12:46 ` Mateusz Guzik
0 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-22 11:10 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Jeff Layton, Linus Torvalds, NeilBrown, Peter Zijlstra,
Ingo Molnar, Jan Kara, linux-fsdevel
On Thu, Aug 22, 2024 at 11:48:45AM GMT, Mateusz Guzik wrote:
> On Thu, Aug 22, 2024 at 10:53:50AM +0200, Christian Brauner wrote:
> > On Wed, Aug 21, 2024 at 03:41:45PM GMT, Jeff Layton wrote:
> > > > if (inode->i_state & I_LRU_ISOLATING) {
> > > > - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
> > > > - wait_queue_head_t *wqh;
> > > > -
> > > > - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
> > > > - spin_unlock(&inode->i_lock);
> > > > - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
> > > > - spin_lock(&inode->i_lock);
> > > > + struct wait_bit_queue_entry wqe;
> > > > + struct wait_queue_head *wq_head;
> > > > +
> > > > + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
> > > > + for (;;) {
> > > > + prepare_to_wait_event(wq_head, &wqe.wq_entry,
> > > > + TASK_UNINTERRUPTIBLE);
> > > > + if (inode->i_state & I_LRU_ISOLATING) {
> > > > + spin_unlock(&inode->i_lock);
> > > > + schedule();
> > > > + spin_lock(&inode->i_lock);
> > > > + continue;
> > > > + }
> > > > + break;
> > > > + }
> > >
> > > nit: personally, I'd prefer this, since you wouldn't need the brackets
> > > or the continue:
> > >
> > > if (!(inode->i_state & LRU_ISOLATING))
> > > break;
> > > spin_unlock();
> > > schedule();
> >
> > Yeah, that's nicer. I'll use that.
>
> In that case may I suggest also short-circuiting the entire func?
>
> if (!(inode->i_state & I_LRU_ISOLATING))
> return;
Afaict, if prepare_to_wait_event() has been called finish_wait() must be
called.
>
> then the entire thing loses one indentation level
>
> Same thing is applicable to the I_SYNC flag.
>
> A non-cosmetic is as follows: is it legal for a wake up to happen
> spuriously?
So, I simply mimicked ___wait_var_event() and __wait_on_bit() - both
loop. I reasoned that ___wait_var_event() via @state and __wait_on_bit()
via @mode take the task state into account to wait in. And my conclusion
was that they don't loop on TASK_UNINTERRUPTIBLE but would loop on e.g.,
TASK_INTERRUPTIBLE. Iow, I assume that spurious wakeups shouldn't happen
with TASK_UNINTERRUPTIBLE.
But it's not something I'm able to assert with absolute confidence so I
erred on the side of caution.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-22 11:10 ` Christian Brauner
@ 2024-08-22 12:46 ` Mateusz Guzik
0 siblings, 0 replies; 49+ messages in thread
From: Mateusz Guzik @ 2024-08-22 12:46 UTC (permalink / raw)
To: Christian Brauner
Cc: Jeff Layton, Linus Torvalds, NeilBrown, Peter Zijlstra,
Ingo Molnar, Jan Kara, linux-fsdevel
On Thu, Aug 22, 2024 at 01:10:43PM +0200, Christian Brauner wrote:
> On Thu, Aug 22, 2024 at 11:48:45AM GMT, Mateusz Guzik wrote:
> > On Thu, Aug 22, 2024 at 10:53:50AM +0200, Christian Brauner wrote:
> > > On Wed, Aug 21, 2024 at 03:41:45PM GMT, Jeff Layton wrote:
> > > > > if (inode->i_state & I_LRU_ISOLATING) {
> > > > > - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
> > > > > - wait_queue_head_t *wqh;
> > > > > -
> > > > > - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
> > > > > - spin_unlock(&inode->i_lock);
> > > > > - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
> > > > > - spin_lock(&inode->i_lock);
> > > > > + struct wait_bit_queue_entry wqe;
> > > > > + struct wait_queue_head *wq_head;
> > > > > +
> > > > > + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
> > > > > + for (;;) {
> > > > > + prepare_to_wait_event(wq_head, &wqe.wq_entry,
> > > > > + TASK_UNINTERRUPTIBLE);
> > > > > + if (inode->i_state & I_LRU_ISOLATING) {
> > > > > + spin_unlock(&inode->i_lock);
> > > > > + schedule();
> > > > > + spin_lock(&inode->i_lock);
> > > > > + continue;
> > > > > + }
> > > > > + break;
> > > > > + }
> > > >
> > > > nit: personally, I'd prefer this, since you wouldn't need the brackets
> > > > or the continue:
> > > >
> > > > if (!(inode->i_state & LRU_ISOLATING))
> > > > break;
> > > > spin_unlock();
> > > > schedule();
> > >
> > > Yeah, that's nicer. I'll use that.
> >
> > In that case may I suggest also short-circuiting the entire func?
> >
> > if (!(inode->i_state & I_LRU_ISOLATING))
> > return;
>
> Afaict, if prepare_to_wait_event() has been called finish_wait() must be
> called.
>
I mean the upfront check, before any other work:
static void inode_wait_for_lru_isolating(struct inode *inode)
{
lockdep_assert_held(&inode->i_lock);
if (!(inode->i_state & I_LRU_ISOLATING))
return;
/* for loop goes here, losing an indentation level but otherwise
* identical. same remark for the writeback thing */
}
> >
> > then the entire thing loses one indentation level
> >
> > Same thing is applicable to the I_SYNC flag.
> >
> > A non-cosmetic is as follows: is it legal for a wake up to happen
> > spuriously?
>
> So, I simply mimicked ___wait_var_event() and __wait_on_bit() - both
> loop. I reasoned that ___wait_var_event() via @state and __wait_on_bit()
> via @mode take the task state into account to wait in. And my conclusion
> was that they don't loop on TASK_UNINTERRUPTIBLE but would loop on e.g.,
> TASK_INTERRUPTIBLE. Iow, I assume that spurious wakeups shouldn't happen
> with TASK_UNINTERRUPTIBLE.
>
> But it's not something I'm able to assert with absolute confidence so I
> erred on the side of caution.
This definitely should be sorted out that callers don't need to loop if
the condition is a one off in the worst case, but I concede getting
there is not *needed* at the moment, just a fixme.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-21 23:34 ` Dave Chinner
@ 2024-08-23 0:08 ` NeilBrown
0 siblings, 0 replies; 49+ messages in thread
From: NeilBrown @ 2024-08-23 0:08 UTC (permalink / raw)
To: Dave Chinner
Cc: Christian Brauner, Linus Torvalds, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Thu, 22 Aug 2024, Dave Chinner wrote:
>
> Even better would be to fix wake_up_var() to not have an implicit
> ordering requirement. Add __wake_up_var() as the "I know what I'm
> doing" API, and have wake_up_var() always issue the memory barrier
> like so:
>
> __wake_up_var(var)
> {
> __wake_up_bit(....);
> }
>
> wake_up_var(var)
> {
> smp_mb();
> __wake_up_var(var);
> }
This is very close to what I proposed
https://lore.kernel.org/all/20240819053605.11706-9-neilb@suse.de/
but Linus doesn't like that approach.
I would prefer a collection of interfaces which combine the update to
the variable with the wake_up.
So
atomic_dec_and_wake()
store_release_and_wake()
though there are a few that don't fit into a common pattern.
Sometimes a minor code re-write can make them fit. Other times I'm not
so sure.
I'm glad I'm not the only one who though that adding a barrier to
wake_up_var() was a good idea though - thanks.
NeilBrown
>
> Then people who don't intimately understand ordering (i.e. just want
> to use a conditional wait variable) or just don't want to
> think about complex memory ordering issues for otherwise obvious and
> simple code can simply use the safe variant.
>
> Those that need to optimise for speed or know they have solid
> ordering or external serialisation can use __wake_up_var() and leave
> a comment as to why that is safe. i.e.:
>
> /*
> * No wait/wakeup ordering required as both waker and waiters
> * are serialised by the inode->i_lock.
> */
> __wake_up_var(....);
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-22 8:27 ` Christian Brauner
2024-08-22 8:37 ` Linus Torvalds
@ 2024-08-23 0:14 ` NeilBrown
2024-08-23 2:52 ` Linus Torvalds
1 sibling, 1 reply; 49+ messages in thread
From: NeilBrown @ 2024-08-23 0:14 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Thu, 22 Aug 2024, Christian Brauner wrote:
> >
> > So this barrier isn't about the bit. This barrier is about the
> > wait_queue. In particular it is about waitqueue_active() call at the
> > start of wake_up_var(). If that test wasn't there and if instead
> > wake_up_var() conditionally called __wake_up(), then there would be no
>
> Did you mean "unconditionally called"?
Yes :-)
>
> > need for any barrier. A comment near wake_up_bit() makes it clear that
> > the barrier is only needed when the spinlock is avoided.
> >
> > On a weakly ordered arch, this test can be effected *before* the write
> > of the bit. If the waiter adds itself to the wait queue and then tests
> > the bit before the bit is set, but after the waitqueue_active() test is
> > put in effect, then the wake_up will never be sent.
> >
> > But .... this is all academic of this code because you don't need a
> > barrier at all. The wake_up happens in a spin_locked region, and the
> > wait is entirely inside the same spin_lock, except for the schedule. A
> > later patch has:
> > spin_unlock(); schedule(); spin_lock();
> >
> > So there is no room for a race. If the bit is cleared before the
> > wait_var_event() equivalent, then no wakeup is needed. When the lock is
> > dropped after the bit is cleared the unlock will have all the barriers
> > needed for the bit to be visible.
> >
> > The only other place that the bit can be cleared is concurrent with the
> > above schedule() while the spinlock isn't held by the waiter. In that
> > case it is again clear that no barrier is needed - or that the
> > spin_unlock/lock provide all the needed barriers.
> >
> > So a better comment would be
> >
> > /* no barrier needs as both waker and waiter are in spin-locked regions */
>
> Thanks for the analysis. I was under the impression that wait_on_inode()
> was called in contexts where no barrier is guaranteed and the bit isn't
> checked with spin_lock() held.
>
Yes it is. I was looking at the I_SYNC waiter and mistakenly thought it
was a model for the others.
A wake_up for I_SYNC being cleared doesn't need the barrier.
Maybe inode_wake_up_bit() should not have a barrier and the various
callers should add whichever barrier is appropriate. That is the model
that Linus prefers for wake_up_bit() and for consistency it should apply
to inode_wake_up_bit() too.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 4/6] inode: port __I_NEW to var event
2024-08-21 15:47 ` [PATCH RFC v2 4/6] inode: port __I_NEW " Christian Brauner
@ 2024-08-23 0:31 ` NeilBrown
2024-08-23 8:20 ` Christian Brauner
0 siblings, 1 reply; 49+ messages in thread
From: NeilBrown @ 2024-08-23 0:31 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, Christian Brauner, linux-fsdevel
On Thu, 22 Aug 2024, Christian Brauner wrote:
> Port the __I_NEW mechanism to use the new var event mechanism.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/bcachefs/fs.c | 10 ++++++----
> fs/dcache.c | 3 +--
> fs/inode.c | 18 ++++++++----------
> include/linux/writeback.h | 3 ++-
> 4 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> index 94c392abef65..c0900c0c0f8a 100644
> --- a/fs/bcachefs/fs.c
> +++ b/fs/bcachefs/fs.c
> @@ -1644,14 +1644,16 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
> break;
> }
> } else if (clean_pass && this_pass_clean) {
> - wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
> - DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
> + struct wait_bit_queue_entry wqe;
> + struct wait_queue_head *wq_head;
>
> - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> + wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW);
I don't think you EXPORT inode_bit_waitqueue() so you cannot use it in
this module.
And maybe it would be good to not export it so that this code can get
cleaned up.
Maybe I'm missing something obvious but it seems weird.
Earlier in this file a comment tells use that bcache doesn't use I_NEW,
but here is bcache explicitly waiting for it.
If bch2_inode_insert() called unlock_new_inode() immediately *before*
adding the inode to vfs_inodes_list instead of just after, this loop
that walks vfs_inodes_list would never need to wait for I_NEW to be
cleared.
I wonder if I am missing something.
NeilBrown
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-21 15:47 ` [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
2024-08-21 19:41 ` Jeff Layton
@ 2024-08-23 0:36 ` NeilBrown
2024-08-23 2:24 ` Linus Torvalds
1 sibling, 1 reply; 49+ messages in thread
From: NeilBrown @ 2024-08-23 0:36 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, Christian Brauner, linux-fsdevel
On Thu, 22 Aug 2024, Christian Brauner wrote:
> Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/inode.c | 26 +++++++++++++++++---------
> 1 file changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index d18e1567c487..c8a5c63dc980 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -510,8 +510,7 @@ static void inode_unpin_lru_isolating(struct inode *inode)
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
> inode->i_state &= ~I_LRU_ISOLATING;
> - smp_mb();
> - wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
> + inode_wake_up_bit(inode, __I_LRU_ISOLATING);
> spin_unlock(&inode->i_lock);
> }
>
> @@ -519,13 +518,22 @@ static void inode_wait_for_lru_isolating(struct inode *inode)
> {
> lockdep_assert_held(&inode->i_lock);
> if (inode->i_state & I_LRU_ISOLATING) {
> - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
> - wait_queue_head_t *wqh;
> -
> - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
> - spin_unlock(&inode->i_lock);
> - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
> - spin_lock(&inode->i_lock);
> + struct wait_bit_queue_entry wqe;
> + struct wait_queue_head *wq_head;
> +
> + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
> + for (;;) {
> + prepare_to_wait_event(wq_head, &wqe.wq_entry,
> + TASK_UNINTERRUPTIBLE);
> + if (inode->i_state & I_LRU_ISOLATING) {
> + spin_unlock(&inode->i_lock);
> + schedule();
> + spin_lock(&inode->i_lock);
> + continue;
> + }
> + break;
> + }
> + finish_wait(wq_head, &wqe.wq_entry);
I would really like to add
wait_var_event_locked(variable, conditon, spinlock)
so that above would be one or two lines.
#define wait_var_event_locked(var, condition, lock) \
do { \
might_sleep(); \
if (condition) \
break; \
___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, \
0, 0, \
spin_unlock(lock);schedule();spin_lock(lock)); \
} while(0)
That can happen after you series lands though.
The wake_up here don't need a memory barrier either.
NeilBrown
> WARN_ON(inode->i_state & I_LRU_ISOLATING);
> }
> }
>
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-23 0:36 ` NeilBrown
@ 2024-08-23 2:24 ` Linus Torvalds
2024-08-23 8:29 ` Christian Brauner
0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2024-08-23 2:24 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Fri, 23 Aug 2024 at 08:37, NeilBrown <neilb@suse.de> wrote:
>
> I would really like to add
>
> wait_var_event_locked(variable, condition, spinlock)
>
> so that above would be one or two lines.
We don't even have that for the regular wait_event, but we do have
wait_event_interruptible_locked
wait_event_interruptible_locked_irq
but they actually only work on the wait-queue lock, not on some
generic "use this lock".
Honestly, I like your version much better, but having two *very*
different models for what "locked" means looks wrong.
The wait-queue lock (our current "locked" event version) is a rather
important special case, though, and takes advantage of holding the
waitqueue lock by then using __add_wait_queue_entry_tail() without
locking.
"You are in a maze of twisty little functions, all alike".
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-23 0:14 ` NeilBrown
@ 2024-08-23 2:52 ` Linus Torvalds
2024-08-23 3:05 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 49+ messages in thread
From: Linus Torvalds @ 2024-08-23 2:52 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Fri, 23 Aug 2024 at 08:14, NeilBrown <neilb@suse.de> wrote:
>
> Maybe inode_wake_up_bit() should not have a barrier and the various
> callers should add whichever barrier is appropriate. That is the model
> that Linus prefers for wake_up_bit() and for consistency it should apply
> to inode_wake_up_bit() too.
I think that for the wake_up_bit() cases in particular, it's fairly
easy to add whatever "clear_and_wakeup()" helpers. After all, there's
only so much you can do to one bit, and I think merging the "act" with
the "wakeup" into one routine is not only going to help fix the memory
ordering problem, I think it's going to result in more readable code
even aside from any memory ordering issues.
The wake_up_var() infrastructure that the inode code uses is a bit
more involved. Not only can the variable be anything at all (so the
operations you can do on it are obviously largely unbounded), but the
inode hack in particular then uses one thing for the actual variable,
and another thing for the address that is used to match up waits and
wakeups.
So I suspect the inode code will have to do its thing explcitly with
the low-level helpers and deal with the memory ordering issues on its
own, adding the smp_mb() manually.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-23 2:52 ` Linus Torvalds
@ 2024-08-23 3:05 ` Linus Torvalds
2024-08-23 3:44 ` Linus Torvalds
2024-08-23 5:01 ` NeilBrown
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
2 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2024-08-23 3:05 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Fri, 23 Aug 2024 at 10:52, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The wake_up_var() infrastructure that the inode code uses is a bit
> more involved. Not only can the variable be anything at all (so the
> operations you can do on it are obviously largely unbounded), but the
> inode hack in particular then uses one thing for the actual variable,
> and another thing for the address that is used to match up waits and
> wakeups.
.. btw, that doesn't mean we can't have helpers for the common cases.
They might have to be macros (so that they just work regardless of the
type), but having a
set_var_and_wake(var, value);
macro that just expands to something like (completely untested "maybe
this works" macro):
#define set_var_and_wake(var,value) do { \
__auto_type __set_ptr = &(var); \
*(__set_ptr) = (value); \
smp_mb(); \
wake_up_var(__set_ptr); \
} while (0)
doesn't sound too bad for at least some common patterns.
Looking around, we do seem to have a pattern of
smp_store_release() -> wake_up_var()
instead of a memory barrier. I don't think that actually works. The
smp_store_release() means that *earlier* accesses will be bounded by
the store operation, but *later* accesses - including very much the
"look if the wait queue is empty" check - are totally unordered by it,
and can be done before the store by the CPU.
But I haven't thought deeply about it, that was just my gut reaction
when seeing the pattern. It superficially makes sense, but I think
it's entirely wrong (it's a "smp_load_acquire()" that would order with
later accesses, but there is no "store with acquire semantics"
operation).
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-23 3:05 ` Linus Torvalds
@ 2024-08-23 3:44 ` Linus Torvalds
0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2024-08-23 3:44 UTC (permalink / raw)
To: NeilBrown
Cc: Christian Brauner, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Fri, 23 Aug 2024 at 11:05, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Looking around, we do seem to have a pattern of
>
> smp_store_release() -> wake_up_var()
>
> instead of a memory barrier. I don't think that actually works.
Hmm. It might not work for the wakeup race, but it might be a good
idea to do the store_release anyway, just for the actual user (ie then
the *use* of the variable may start with a "smp_load_acquire()", and
the release->acquire semantics means that everything that was done
before the release is visible after the acquire.
Of course, the smp_mb() will force that ordering too, but for a true
concurrent user that doesn't actually need waking up, it might see the
newly stores var value before the smp_mb() happens on the generating
side.
End result: those code paths may want *both* the smp_store_release()
and the smp_mb(), because they are ordering different accesses.
Just goes to show that this whole thing is more complicated than just
"wait for a value".
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 1/6] fs: add i_state helpers
2024-08-23 2:52 ` Linus Torvalds
2024-08-23 3:05 ` Linus Torvalds
@ 2024-08-23 5:01 ` NeilBrown
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
2 siblings, 0 replies; 49+ messages in thread
From: NeilBrown @ 2024-08-23 5:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Fri, 23 Aug 2024, Linus Torvalds wrote:
> On Fri, 23 Aug 2024 at 08:14, NeilBrown <neilb@suse.de> wrote:
> >
> > Maybe inode_wake_up_bit() should not have a barrier and the various
> > callers should add whichever barrier is appropriate. That is the model
> > that Linus prefers for wake_up_bit() and for consistency it should apply
> > to inode_wake_up_bit() too.
>
> I think that for the wake_up_bit() cases in particular, it's fairly
> easy to add whatever "clear_and_wakeup()" helpers. After all, there's
> only so much you can do to one bit, and I think merging the "act" with
> the "wakeup" into one routine is not only going to help fix the memory
> ordering problem, I think it's going to result in more readable code
> even aside from any memory ordering issues.
That patterns that don't easily fit a helper include:
- skipping the wake_up when there is no hurry.
nfs_page_clear_headlock() does this is PG_CONTENDED1
isn't set. intel_recv_event() does something very similar
if STATE_FIRMWARE_LOADED isn't set.
I imagine changing these to
if (need a wakeup)
clear_bit_and_wakeup(...)
else
clear_bit()
rpc_make_runnable() has three case and maybe each would need
a separate clear_bit in order to have a clear_bit adjacent to
the wake_up(). That might actually be a good thing, I haven't
really tried yet.
- key_reject_and_link() separate the test_and_clear from the
wake_up_bit() for reasons that aren't completely obvious to me.
__key_instantiate_and_link() does the same.
Maybe they just move the wake_up outside the mutex....
>
> The wake_up_var() infrastructure that the inode code uses is a bit
> more involved. Not only can the variable be anything at all (so the
> operations you can do on it are obviously largely unbounded), but the
> inode hack in particular then uses one thing for the actual variable,
> and another thing for the address that is used to match up waits and
> wakeups.
>
> So I suspect the inode code will have to do its thing explcitly with
> the low-level helpers and deal with the memory ordering issues on its
> own, adding the smp_mb() manually.
The problem here is that "wake_up_var()" doesn't *look* like a
low-level helper. Maybe we could replace the few remaining instances
with __wake_up_var() once the easy cases are fixed??
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 4/6] inode: port __I_NEW to var event
2024-08-23 0:31 ` NeilBrown
@ 2024-08-23 8:20 ` Christian Brauner
2024-08-23 11:07 ` Christian Brauner
0 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 8:20 UTC (permalink / raw)
To: NeilBrown
Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Fri, Aug 23, 2024 at 10:31:55AM GMT, NeilBrown wrote:
> On Thu, 22 Aug 2024, Christian Brauner wrote:
> > Port the __I_NEW mechanism to use the new var event mechanism.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/bcachefs/fs.c | 10 ++++++----
> > fs/dcache.c | 3 +--
> > fs/inode.c | 18 ++++++++----------
> > include/linux/writeback.h | 3 ++-
> > 4 files changed, 17 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> > index 94c392abef65..c0900c0c0f8a 100644
> > --- a/fs/bcachefs/fs.c
> > +++ b/fs/bcachefs/fs.c
> > @@ -1644,14 +1644,16 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
> > break;
> > }
> > } else if (clean_pass && this_pass_clean) {
> > - wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
> > - DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
> > + struct wait_bit_queue_entry wqe;
> > + struct wait_queue_head *wq_head;
> >
> > - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> > + wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW);
>
> I don't think you EXPORT inode_bit_waitqueue() so you cannot use it in
> this module.
I had already fixed that in-tree because I got a build failure on one of
my test machines.
>
> And maybe it would be good to not export it so that this code can get
> cleaned up.
Maybe but I prefer this just to be a follow-up. So that we get hung up
on ever more details.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-23 2:24 ` Linus Torvalds
@ 2024-08-23 8:29 ` Christian Brauner
0 siblings, 0 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 8:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
linux-fsdevel
On Fri, Aug 23, 2024 at 10:24:57AM GMT, Linus Torvalds wrote:
> On Fri, 23 Aug 2024 at 08:37, NeilBrown <neilb@suse.de> wrote:
> >
> > I would really like to add
> >
> > wait_var_event_locked(variable, condition, spinlock)
> >
> > so that above would be one or two lines.
>
> We don't even have that for the regular wait_event, but we do have
>
> wait_event_interruptible_locked
> wait_event_interruptible_locked_irq
>
> but they actually only work on the wait-queue lock, not on some
> generic "use this lock".
>
> Honestly, I like your version much better, but having two *very*
> different models for what "locked" means looks wrong.
>
> The wait-queue lock (our current "locked" event version) is a rather
> important special case, though, and takes advantage of holding the
> waitqueue lock by then using __add_wait_queue_entry_tail() without
> locking.
>
> "You are in a maze of twisty little functions, all alike".
"You could've used a little more cowbell^wunderscores."
__fput()
____fput()
__wait_var_event()
___wait_var_event()
https://www.youtube.com/watch?v=cVsQLlk-T0s
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH RFC v2 4/6] inode: port __I_NEW to var event
2024-08-23 8:20 ` Christian Brauner
@ 2024-08-23 11:07 ` Christian Brauner
0 siblings, 0 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 11:07 UTC (permalink / raw)
To: NeilBrown
Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Jeff Layton,
Jan Kara, linux-fsdevel
On Fri, Aug 23, 2024 at 10:20:31AM GMT, Christian Brauner wrote:
> On Fri, Aug 23, 2024 at 10:31:55AM GMT, NeilBrown wrote:
> > On Thu, 22 Aug 2024, Christian Brauner wrote:
> > > Port the __I_NEW mechanism to use the new var event mechanism.
> > >
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > > ---
> > > fs/bcachefs/fs.c | 10 ++++++----
> > > fs/dcache.c | 3 +--
> > > fs/inode.c | 18 ++++++++----------
> > > include/linux/writeback.h | 3 ++-
> > > 4 files changed, 17 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> > > index 94c392abef65..c0900c0c0f8a 100644
> > > --- a/fs/bcachefs/fs.c
> > > +++ b/fs/bcachefs/fs.c
> > > @@ -1644,14 +1644,16 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
> > > break;
> > > }
> > > } else if (clean_pass && this_pass_clean) {
> > > - wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
> > > - DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
> > > + struct wait_bit_queue_entry wqe;
> > > + struct wait_queue_head *wq_head;
> > >
> > > - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> > > + wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW);
> >
> > I don't think you EXPORT inode_bit_waitqueue() so you cannot use it in
> > this module.
>
> I had already fixed that in-tree because I got a build failure on one of
> my test machines.
>
> >
> > And maybe it would be good to not export it so that this code can get
> > cleaned up.
>
> Maybe but I prefer this just to be a follow-up. So that we get hung up
*don't
> on ever more details.
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v3 0/6] inode: turn i_state into u32
2024-08-23 2:52 ` Linus Torvalds
2024-08-23 3:05 ` Linus Torvalds
2024-08-23 5:01 ` NeilBrown
@ 2024-08-23 12:47 ` Christian Brauner
2024-08-23 12:47 ` [PATCH v3 1/6] fs: add i_state helpers Christian Brauner
` (6 more replies)
2 siblings, 7 replies; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
Hey,
This is v3. I changed to manual barriers as requested and commented
them.
---
I've recently looked for some free space in struct inode again because
of some exec kerfuffle we recently had and while my idea didn't turn
into anything I noticed that we often waste bytes when using wait bit
operations. So I set out to switch that to another mechanism that would
allow us to free up bytes. So this is an attempt to turn i_state from
an unsigned long into an u32 using the individual bytes of i_state as
addresses for the wait var event mechanism (Thanks to Linus for that idea.).
This survives LTP, xfstests on various filesystems, and will-it-scale.
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v3:
- Use manual barrier.
- Link to v2: https://lore.kernel.org/r/20240821-work-i_state-v2-0-67244769f102@kernel.org
Changes in v2:
- Actually send out the correct branch.
- Link to v1: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org
---
---
base-commit: 01e603fb789c75b3a0c63bddd42a42a710da7a52
change-id: 20240820-work-i_state-4e34db39bcf8
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v3 1/6] fs: add i_state helpers
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
@ 2024-08-23 12:47 ` Christian Brauner
2024-09-05 16:01 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 2/6] fs: reorder i_state bits Christian Brauner
` (5 subsequent siblings)
6 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
The i_state member is an unsigned long so that it can be used with the
wait bit infrastructure which expects unsigned long. This wastes 4 bytes
which we're unlikely to ever use. Switch to using the var event wait
mechanism using the address of the bit. Thanks to Linus for the address
idea.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/inode.c | 11 +++++++++++
include/linux/fs.h | 15 +++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 154f8689457f..877c64a1bf63 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -472,6 +472,17 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
inode->i_state |= I_REFERENCED;
}
+struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
+ struct inode *inode, u32 bit)
+{
+ void *bit_address;
+
+ bit_address = inode_state_wait_address(inode, bit);
+ init_wait_var_entry(wqe, bit_address, 0);
+ return __var_waitqueue(bit_address);
+}
+EXPORT_SYMBOL(inode_bit_waitqueue);
+
/*
* Add inode to LRU if needed (inode is unused and clean).
*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 23e7d46b818a..1d895b8cb801 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -744,6 +744,21 @@ struct inode {
void *i_private; /* fs or device private pointer */
} __randomize_layout;
+/*
+ * Get bit address from inode->i_state to use with wait_var_event()
+ * infrastructre.
+ */
+#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
+
+struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
+ struct inode *inode, u32 bit);
+
+static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
+{
+ /* Caller is responsible for correct memory barriers. */
+ wake_up_var(inode_state_wait_address(inode, bit));
+}
+
struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
static inline unsigned int i_blocksize(const struct inode *node)
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v3 2/6] fs: reorder i_state bits
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
2024-08-23 12:47 ` [PATCH v3 1/6] fs: add i_state helpers Christian Brauner
@ 2024-08-23 12:47 ` Christian Brauner
2024-09-05 16:02 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 3/6] inode: port __I_SYNC to var event Christian Brauner
` (4 subsequent siblings)
6 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
so that we can use the first bits to derive unique addresses from
i_state.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/fs.h | 38 +++++++++++++++++++++-----------------
1 file changed, 21 insertions(+), 17 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1d895b8cb801..f257f8fad7d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2417,28 +2417,32 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
* i_count.
*
* Q: What is the difference between I_WILL_FREE and I_FREEING?
+ *
+ * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
+ * upon. There's one free address left.
*/
-#define I_DIRTY_SYNC (1 << 0)
-#define I_DIRTY_DATASYNC (1 << 1)
-#define I_DIRTY_PAGES (1 << 2)
-#define __I_NEW 3
+#define __I_NEW 0
#define I_NEW (1 << __I_NEW)
-#define I_WILL_FREE (1 << 4)
-#define I_FREEING (1 << 5)
-#define I_CLEAR (1 << 6)
-#define __I_SYNC 7
+#define __I_SYNC 1
#define I_SYNC (1 << __I_SYNC)
-#define I_REFERENCED (1 << 8)
+#define __I_LRU_ISOLATING 2
+#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
+
+#define I_DIRTY_SYNC (1 << 3)
+#define I_DIRTY_DATASYNC (1 << 4)
+#define I_DIRTY_PAGES (1 << 5)
+#define I_WILL_FREE (1 << 6)
+#define I_FREEING (1 << 7)
+#define I_CLEAR (1 << 8)
+#define I_REFERENCED (1 << 9)
#define I_LINKABLE (1 << 10)
#define I_DIRTY_TIME (1 << 11)
-#define I_WB_SWITCH (1 << 13)
-#define I_OVL_INUSE (1 << 14)
-#define I_CREATING (1 << 15)
-#define I_DONTCACHE (1 << 16)
-#define I_SYNC_QUEUED (1 << 17)
-#define I_PINNING_NETFS_WB (1 << 18)
-#define __I_LRU_ISOLATING 19
-#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
+#define I_WB_SWITCH (1 << 12)
+#define I_OVL_INUSE (1 << 13)
+#define I_CREATING (1 << 14)
+#define I_DONTCACHE (1 << 15)
+#define I_SYNC_QUEUED (1 << 16)
+#define I_PINNING_NETFS_WB (1 << 17)
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v3 3/6] inode: port __I_SYNC to var event
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
2024-08-23 12:47 ` [PATCH v3 1/6] fs: add i_state helpers Christian Brauner
2024-08-23 12:47 ` [PATCH v3 2/6] fs: reorder i_state bits Christian Brauner
@ 2024-08-23 12:47 ` Christian Brauner
2024-09-05 16:02 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 4/6] inode: port __I_NEW " Christian Brauner
` (3 subsequent siblings)
6 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
Port the __I_SYNC mechanism to use the new var event mechanism.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/fs-writeback.c | 45 +++++++++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 16 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a5006329f6f..d8bec3c1bb1f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1386,12 +1386,13 @@ static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
static void inode_sync_complete(struct inode *inode)
{
+ assert_spin_locked(&inode->i_lock);
+
inode->i_state &= ~I_SYNC;
/* If inode is clean an unused, put it into LRU now... */
inode_add_lru(inode);
- /* Waiters must see I_SYNC cleared before being woken up */
- smp_mb();
- wake_up_bit(&inode->i_state, __I_SYNC);
+ /* Called with inode->i_lock which ensures memory ordering. */
+ inode_wake_up_bit(inode, __I_SYNC);
}
static bool inode_dirtied_after(struct inode *inode, unsigned long t)
@@ -1512,17 +1513,25 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
*/
void inode_wait_for_writeback(struct inode *inode)
{
- DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
- wait_queue_head_t *wqh;
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
+
+ assert_spin_locked(&inode->i_lock);
+
+ if (!(inode->i_state & I_SYNC))
+ return;
- lockdep_assert_held(&inode->i_lock);
- wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
- while (inode->i_state & I_SYNC) {
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
+ for (;;) {
+ prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
+ /* Checking I_SYNC with inode->i_lock guarantees memory ordering. */
+ if (!(inode->i_state & I_SYNC))
+ break;
spin_unlock(&inode->i_lock);
- __wait_on_bit(wqh, &wq, bit_wait,
- TASK_UNINTERRUPTIBLE);
+ schedule();
spin_lock(&inode->i_lock);
}
+ finish_wait(wq_head, &wqe.wq_entry);
}
/*
@@ -1533,16 +1542,20 @@ void inode_wait_for_writeback(struct inode *inode)
static void inode_sleep_on_writeback(struct inode *inode)
__releases(inode->i_lock)
{
- DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
- int sleep;
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
+ bool sleep;
+
+ assert_spin_locked(&inode->i_lock);
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
- sleep = inode->i_state & I_SYNC;
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
+ prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
+ /* Checking I_SYNC with inode->i_lock guarantees memory ordering. */
+ sleep = !!(inode->i_state & I_SYNC);
spin_unlock(&inode->i_lock);
if (sleep)
schedule();
- finish_wait(wqh, &wait);
+ finish_wait(wq_head, &wqe.wq_entry);
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v3 4/6] inode: port __I_NEW to var event
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
` (2 preceding siblings ...)
2024-08-23 12:47 ` [PATCH v3 3/6] inode: port __I_SYNC to var event Christian Brauner
@ 2024-08-23 12:47 ` Christian Brauner
2024-09-06 13:30 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
` (2 subsequent siblings)
6 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
Port the __I_NEW mechanism to use the new var event mechanism.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
I'm not fully convinced that READ_ONCE() in wait_on_inode() is
sufficient when combined with smp_mb() before wake_up_var(). Maybe we
need smp_store_release() on inode->i_state before smp_mb() and paired
with smp_load_acquire() in wait_on_inode().
---
fs/bcachefs/fs.c | 10 ++++++----
fs/dcache.c | 7 ++++++-
fs/inode.c | 32 ++++++++++++++++++++++++--------
include/linux/writeback.h | 3 ++-
4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 94c392abef65..c0900c0c0f8a 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -1644,14 +1644,16 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
break;
}
} else if (clean_pass && this_pass_clean) {
- wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
- DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
- prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+ wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW);
+ prepare_to_wait_event(wq_head, &wqe.wq_entry,
+ TASK_UNINTERRUPTIBLE);
mutex_unlock(&c->vfs_inodes_lock);
schedule();
- finish_wait(wq, &wait.wq_entry);
+ finish_wait(wq_head, &wqe.wq_entry);
goto again;
}
}
diff --git a/fs/dcache.c b/fs/dcache.c
index 1af75fa68638..894e38cdf4d0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1908,8 +1908,13 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
__d_instantiate(entry, inode);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW & ~I_CREATING;
+ /*
+ * Pairs with the barrier in prepare_to_wait_event() to make sure
+ * ___wait_var_event() either sees the bit cleared or
+ * waitqueue_active() check in wake_up_var() sees the waiter.
+ */
smp_mb();
- wake_up_bit(&inode->i_state, __I_NEW);
+ inode_wake_up_bit(inode, __I_NEW);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_instantiate_new);
diff --git a/fs/inode.c b/fs/inode.c
index 877c64a1bf63..37f20c7c2f72 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -734,7 +734,13 @@ static void evict(struct inode *inode)
* used as an indicator whether blocking on it is safe.
*/
spin_lock(&inode->i_lock);
- wake_up_bit(&inode->i_state, __I_NEW);
+ /*
+ * Pairs with the barrier in prepare_to_wait_event() to make sure
+ * ___wait_var_event() either sees the bit cleared or
+ * waitqueue_active() check in wake_up_var() sees the waiter.
+ */
+ smp_mb();
+ inode_wake_up_bit(inode, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
spin_unlock(&inode->i_lock);
@@ -1142,8 +1148,13 @@ void unlock_new_inode(struct inode *inode)
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW & ~I_CREATING;
+ /*
+ * Pairs with the barrier in prepare_to_wait_event() to make sure
+ * ___wait_var_event() either sees the bit cleared or
+ * waitqueue_active() check in wake_up_var() sees the waiter.
+ */
smp_mb();
- wake_up_bit(&inode->i_state, __I_NEW);
+ inode_wake_up_bit(inode, __I_NEW);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(unlock_new_inode);
@@ -1154,8 +1165,13 @@ void discard_new_inode(struct inode *inode)
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW;
+ /*
+ * Pairs with the barrier in prepare_to_wait_event() to make sure
+ * ___wait_var_event() either sees the bit cleared or
+ * waitqueue_active() check in wake_up_var() sees the waiter.
+ */
smp_mb();
- wake_up_bit(&inode->i_state, __I_NEW);
+ inode_wake_up_bit(inode, __I_NEW);
spin_unlock(&inode->i_lock);
iput(inode);
}
@@ -2344,8 +2360,8 @@ EXPORT_SYMBOL(inode_needs_sync);
*/
static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked)
{
- wait_queue_head_t *wq;
- DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
/*
* Handle racing against evict(), see that routine for more details.
@@ -2356,14 +2372,14 @@ static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_lock
return;
}
- wq = bit_waitqueue(&inode->i_state, __I_NEW);
- prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW);
+ prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
if (is_inode_hash_locked)
spin_unlock(&inode_hash_lock);
schedule();
- finish_wait(wq, &wait.wq_entry);
+ finish_wait(wq_head, &wqe.wq_entry);
if (is_inode_hash_locked)
spin_lock(&inode_hash_lock);
rcu_read_lock();
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 56b85841ae4c..8f651bb0a1a5 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -200,7 +200,8 @@ void inode_io_list_del(struct inode *inode);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
{
- wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE);
+ wait_var_event(inode_state_wait_address(inode, __I_NEW),
+ !(READ_ONCE(inode->i_state) & I_NEW));
}
#ifdef CONFIG_CGROUP_WRITEBACK
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v3 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
` (3 preceding siblings ...)
2024-08-23 12:47 ` [PATCH v3 4/6] inode: port __I_NEW " Christian Brauner
@ 2024-08-23 12:47 ` Christian Brauner
2024-09-09 7:35 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 6/6] inode: make i_state a u32 Christian Brauner
2024-08-23 15:06 ` [PATCH v3 0/6] inode: turn i_state into u32 Josef Bacik
6 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/inode.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 37f20c7c2f72..8fb8e4f9acc3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -511,24 +511,35 @@ static void inode_unpin_lru_isolating(struct inode *inode)
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
inode->i_state &= ~I_LRU_ISOLATING;
- smp_mb();
- wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
+ /* Called with inode->i_lock which ensures memory ordering. */
+ inode_wake_up_bit(inode, __I_LRU_ISOLATING);
spin_unlock(&inode->i_lock);
}
static void inode_wait_for_lru_isolating(struct inode *inode)
{
+ struct wait_bit_queue_entry wqe;
+ struct wait_queue_head *wq_head;
+
lockdep_assert_held(&inode->i_lock);
- if (inode->i_state & I_LRU_ISOLATING) {
- DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
- wait_queue_head_t *wqh;
+ if (!(inode->i_state & I_LRU_ISOLATING))
+ return;
- wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
+ wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
+ for (;;) {
+ prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
+ /*
+ * Checking I_LRU_ISOLATING with inode->i_lock guarantees
+ * memory ordering.
+ */
+ if (!(inode->i_state & I_LRU_ISOLATING))
+ break;
spin_unlock(&inode->i_lock);
- __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
+ schedule();
spin_lock(&inode->i_lock);
- WARN_ON(inode->i_state & I_LRU_ISOLATING);
}
+ finish_wait(wq_head, &wqe.wq_entry);
+ WARN_ON(inode->i_state & I_LRU_ISOLATING);
}
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v3 6/6] inode: make i_state a u32
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
` (4 preceding siblings ...)
2024-08-23 12:47 ` [PATCH v3 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
@ 2024-08-23 12:47 ` Christian Brauner
2024-09-09 7:35 ` Jan Kara
2024-08-23 15:06 ` [PATCH v3 0/6] inode: turn i_state into u32 Josef Bacik
6 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2024-08-23 12:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
Now that we use the wait var event mechanism make i_state a u32 and free
up 4 bytes. This means we currently have two 4 byte holes in struct
inode which we can pack.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/fs.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f257f8fad7d0..746ac60cef92 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -681,7 +681,8 @@ struct inode {
#endif
/* Misc */
- unsigned long i_state;
+ u32 i_state;
+ /* 32-bit hole */
struct rw_semaphore i_rwsem;
unsigned long dirtied_when; /* jiffies of first dirtying */
--
2.43.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH v3 0/6] inode: turn i_state into u32
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
` (5 preceding siblings ...)
2024-08-23 12:47 ` [PATCH v3 6/6] inode: make i_state a u32 Christian Brauner
@ 2024-08-23 15:06 ` Josef Bacik
6 siblings, 0 replies; 49+ messages in thread
From: Josef Bacik @ 2024-08-23 15:06 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Fri, Aug 23, 2024 at 02:47:34PM +0200, Christian Brauner wrote:
> Hey,
>
> This is v3. I changed to manual barriers as requested and commented
> them.
>
> ---
>
> I've recently looked for some free space in struct inode again because
> of some exec kerfuffle we recently had and while my idea didn't turn
> into anything I noticed that we often waste bytes when using wait bit
> operations. So I set out to switch that to another mechanism that would
> allow us to free up bytes. So this is an attempt to turn i_state from
> an unsigned long into an u32 using the individual bytes of i_state as
> addresses for the wait var event mechanism (Thanks to Linus for that idea.).
>
> This survives LTP, xfstests on various filesystems, and will-it-scale.
>
> To: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v3 1/6] fs: add i_state helpers
2024-08-23 12:47 ` [PATCH v3 1/6] fs: add i_state helpers Christian Brauner
@ 2024-09-05 16:01 ` Jan Kara
0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2024-09-05 16:01 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Fri 23-08-24 14:47:35, Christian Brauner wrote:
> The i_state member is an unsigned long so that it can be used with the
> wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> which we're unlikely to ever use. Switch to using the var event wait
> mechanism using the address of the bit. Thanks to Linus for the address
> idea.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Looks good. Interesting trick with the wakeup address. I'm not sure we
100% need it since e.g. __I_NEW waiters are unlikely to combine with
__I_SYNC waiters so sharing a waitqueue should not cause big issues but I
guess we'll deal with that if we run out of waiting bits with the current
scheme. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/inode.c | 11 +++++++++++
> include/linux/fs.h | 15 +++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 154f8689457f..877c64a1bf63 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -472,6 +472,17 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> inode->i_state |= I_REFERENCED;
> }
>
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> + struct inode *inode, u32 bit)
> +{
> + void *bit_address;
> +
> + bit_address = inode_state_wait_address(inode, bit);
> + init_wait_var_entry(wqe, bit_address, 0);
> + return __var_waitqueue(bit_address);
> +}
> +EXPORT_SYMBOL(inode_bit_waitqueue);
> +
> /*
> * Add inode to LRU if needed (inode is unused and clean).
> *
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 23e7d46b818a..1d895b8cb801 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -744,6 +744,21 @@ struct inode {
> void *i_private; /* fs or device private pointer */
> } __randomize_layout;
>
> +/*
> + * Get bit address from inode->i_state to use with wait_var_event()
> + * infrastructre.
> + */
> +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> +
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> + struct inode *inode, u32 bit);
> +
> +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> +{
> + /* Caller is responsible for correct memory barriers. */
> + wake_up_var(inode_state_wait_address(inode, bit));
> +}
> +
> struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
>
> static inline unsigned int i_blocksize(const struct inode *node)
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v3 2/6] fs: reorder i_state bits
2024-08-23 12:47 ` [PATCH v3 2/6] fs: reorder i_state bits Christian Brauner
@ 2024-09-05 16:02 ` Jan Kara
0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2024-09-05 16:02 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Fri 23-08-24 14:47:36, Christian Brauner wrote:
> so that we can use the first bits to derive unique addresses from
> i_state.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Sure. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/fs.h | 38 +++++++++++++++++++++-----------------
> 1 file changed, 21 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1d895b8cb801..f257f8fad7d0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2417,28 +2417,32 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
> * i_count.
> *
> * Q: What is the difference between I_WILL_FREE and I_FREEING?
> + *
> + * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
> + * upon. There's one free address left.
> */
> -#define I_DIRTY_SYNC (1 << 0)
> -#define I_DIRTY_DATASYNC (1 << 1)
> -#define I_DIRTY_PAGES (1 << 2)
> -#define __I_NEW 3
> +#define __I_NEW 0
> #define I_NEW (1 << __I_NEW)
> -#define I_WILL_FREE (1 << 4)
> -#define I_FREEING (1 << 5)
> -#define I_CLEAR (1 << 6)
> -#define __I_SYNC 7
> +#define __I_SYNC 1
> #define I_SYNC (1 << __I_SYNC)
> -#define I_REFERENCED (1 << 8)
> +#define __I_LRU_ISOLATING 2
> +#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
> +
> +#define I_DIRTY_SYNC (1 << 3)
> +#define I_DIRTY_DATASYNC (1 << 4)
> +#define I_DIRTY_PAGES (1 << 5)
> +#define I_WILL_FREE (1 << 6)
> +#define I_FREEING (1 << 7)
> +#define I_CLEAR (1 << 8)
> +#define I_REFERENCED (1 << 9)
> #define I_LINKABLE (1 << 10)
> #define I_DIRTY_TIME (1 << 11)
> -#define I_WB_SWITCH (1 << 13)
> -#define I_OVL_INUSE (1 << 14)
> -#define I_CREATING (1 << 15)
> -#define I_DONTCACHE (1 << 16)
> -#define I_SYNC_QUEUED (1 << 17)
> -#define I_PINNING_NETFS_WB (1 << 18)
> -#define __I_LRU_ISOLATING 19
> -#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING)
> +#define I_WB_SWITCH (1 << 12)
> +#define I_OVL_INUSE (1 << 13)
> +#define I_CREATING (1 << 14)
> +#define I_DONTCACHE (1 << 15)
> +#define I_SYNC_QUEUED (1 << 16)
> +#define I_PINNING_NETFS_WB (1 << 17)
>
> #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v3 3/6] inode: port __I_SYNC to var event
2024-08-23 12:47 ` [PATCH v3 3/6] inode: port __I_SYNC to var event Christian Brauner
@ 2024-09-05 16:02 ` Jan Kara
0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2024-09-05 16:02 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Fri 23-08-24 14:47:37, Christian Brauner wrote:
> Port the __I_SYNC mechanism to use the new var event mechanism.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/fs-writeback.c | 45 +++++++++++++++++++++++++++++----------------
> 1 file changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1a5006329f6f..d8bec3c1bb1f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1386,12 +1386,13 @@ static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
>
> static void inode_sync_complete(struct inode *inode)
> {
> + assert_spin_locked(&inode->i_lock);
> +
> inode->i_state &= ~I_SYNC;
> /* If inode is clean an unused, put it into LRU now... */
> inode_add_lru(inode);
> - /* Waiters must see I_SYNC cleared before being woken up */
> - smp_mb();
> - wake_up_bit(&inode->i_state, __I_SYNC);
> + /* Called with inode->i_lock which ensures memory ordering. */
> + inode_wake_up_bit(inode, __I_SYNC);
> }
>
> static bool inode_dirtied_after(struct inode *inode, unsigned long t)
> @@ -1512,17 +1513,25 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
> */
> void inode_wait_for_writeback(struct inode *inode)
> {
> - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
> - wait_queue_head_t *wqh;
> + struct wait_bit_queue_entry wqe;
> + struct wait_queue_head *wq_head;
> +
> + assert_spin_locked(&inode->i_lock);
> +
> + if (!(inode->i_state & I_SYNC))
> + return;
>
> - lockdep_assert_held(&inode->i_lock);
> - wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
> - while (inode->i_state & I_SYNC) {
> + wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
> + for (;;) {
> + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
> + /* Checking I_SYNC with inode->i_lock guarantees memory ordering. */
> + if (!(inode->i_state & I_SYNC))
> + break;
> spin_unlock(&inode->i_lock);
> - __wait_on_bit(wqh, &wq, bit_wait,
> - TASK_UNINTERRUPTIBLE);
> + schedule();
> spin_lock(&inode->i_lock);
> }
> + finish_wait(wq_head, &wqe.wq_entry);
> }
>
> /*
> @@ -1533,16 +1542,20 @@ void inode_wait_for_writeback(struct inode *inode)
> static void inode_sleep_on_writeback(struct inode *inode)
> __releases(inode->i_lock)
> {
> - DEFINE_WAIT(wait);
> - wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
> - int sleep;
> + struct wait_bit_queue_entry wqe;
> + struct wait_queue_head *wq_head;
> + bool sleep;
> +
> + assert_spin_locked(&inode->i_lock);
>
> - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> - sleep = inode->i_state & I_SYNC;
> + wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
> + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
> + /* Checking I_SYNC with inode->i_lock guarantees memory ordering. */
> + sleep = !!(inode->i_state & I_SYNC);
> spin_unlock(&inode->i_lock);
> if (sleep)
> schedule();
> - finish_wait(wqh, &wait);
> + finish_wait(wq_head, &wqe.wq_entry);
> }
>
> /*
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v3 4/6] inode: port __I_NEW to var event
2024-08-23 12:47 ` [PATCH v3 4/6] inode: port __I_NEW " Christian Brauner
@ 2024-09-06 13:30 ` Jan Kara
0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2024-09-06 13:30 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Fri 23-08-24 14:47:38, Christian Brauner wrote:
> Port the __I_NEW mechanism to use the new var event mechanism.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> I'm not fully convinced that READ_ONCE() in wait_on_inode() is
> sufficient when combined with smp_mb() before wake_up_var(). Maybe we
> need smp_store_release() on inode->i_state before smp_mb() and paired
> with smp_load_acquire() in wait_on_inode().
> ---
> fs/bcachefs/fs.c | 10 ++++++----
> fs/dcache.c | 7 ++++++-
> fs/inode.c | 32 ++++++++++++++++++++++++--------
> include/linux/writeback.h | 3 ++-
> 4 files changed, 38 insertions(+), 14 deletions(-)
>
> diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> index 94c392abef65..c0900c0c0f8a 100644
> --- a/fs/bcachefs/fs.c
> +++ b/fs/bcachefs/fs.c
> @@ -1644,14 +1644,16 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
> break;
> }
> } else if (clean_pass && this_pass_clean) {
> - wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
> - DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
> + struct wait_bit_queue_entry wqe;
> + struct wait_queue_head *wq_head;
>
> - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> + wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW);
> + prepare_to_wait_event(wq_head, &wqe.wq_entry,
> + TASK_UNINTERRUPTIBLE);
> mutex_unlock(&c->vfs_inodes_lock);
>
> schedule();
> - finish_wait(wq, &wait.wq_entry);
> + finish_wait(wq_head, &wqe.wq_entry);
The opencoded waits are somewhat ugly. I understand this is because of
c->vfs_inodes_lock but perhaps we could introduce wait_on_inode() with some
macro magic (to do what we need before going to sleep) to make it easier?
> diff --git a/fs/inode.c b/fs/inode.c
> index 877c64a1bf63..37f20c7c2f72 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -734,7 +734,13 @@ static void evict(struct inode *inode)
> * used as an indicator whether blocking on it is safe.
> */
> spin_lock(&inode->i_lock);
> - wake_up_bit(&inode->i_state, __I_NEW);
> + /*
> + * Pairs with the barrier in prepare_to_wait_event() to make sure
> + * ___wait_var_event() either sees the bit cleared or
> + * waitqueue_active() check in wake_up_var() sees the waiter.
> + */
> + smp_mb();
Why did you add smp_mb() here? We wake up on inode state change which has
happened quite some time ago and before several things guaranteeing a
barrier...
> + inode_wake_up_bit(inode, __I_NEW);
> BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> spin_unlock(&inode->i_lock);
>
...
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 56b85841ae4c..8f651bb0a1a5 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -200,7 +200,8 @@ void inode_io_list_del(struct inode *inode);
> /* writeback.h requires fs.h; it, too, is not included from here. */
> static inline void wait_on_inode(struct inode *inode)
> {
> - wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE);
> + wait_var_event(inode_state_wait_address(inode, __I_NEW),
> + !(READ_ONCE(inode->i_state) & I_NEW));
> }
>
> #ifdef CONFIG_CGROUP_WRITEBACK
As far as memory ordering goes, this looks good to me. But READ_ONCE() is
not really guaranteed to be enough in terms of what inode state you're
going to see when racing with i_state updates. i_state updates would have
to happen through WRITE_ONCE() for this to be safe (wrt insane compiler
deoptimizations ;)). Now that's not a new problem so I'm not sure we need
to deal with it in this patch set but it would probably deserve a comment.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v3 5/6] inode: port __I_LRU_ISOLATING to var event
2024-08-23 12:47 ` [PATCH v3 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
@ 2024-09-09 7:35 ` Jan Kara
0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2024-09-09 7:35 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Fri 23-08-24 14:47:39, Christian Brauner wrote:
> Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/inode.c | 27 +++++++++++++++++++--------
> 1 file changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 37f20c7c2f72..8fb8e4f9acc3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -511,24 +511,35 @@ static void inode_unpin_lru_isolating(struct inode *inode)
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
> inode->i_state &= ~I_LRU_ISOLATING;
> - smp_mb();
> - wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
> + /* Called with inode->i_lock which ensures memory ordering. */
> + inode_wake_up_bit(inode, __I_LRU_ISOLATING);
> spin_unlock(&inode->i_lock);
> }
>
> static void inode_wait_for_lru_isolating(struct inode *inode)
> {
> + struct wait_bit_queue_entry wqe;
> + struct wait_queue_head *wq_head;
> +
> lockdep_assert_held(&inode->i_lock);
> - if (inode->i_state & I_LRU_ISOLATING) {
> - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
> - wait_queue_head_t *wqh;
> + if (!(inode->i_state & I_LRU_ISOLATING))
> + return;
>
> - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
> + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
> + for (;;) {
> + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
> + /*
> + * Checking I_LRU_ISOLATING with inode->i_lock guarantees
> + * memory ordering.
> + */
> + if (!(inode->i_state & I_LRU_ISOLATING))
> + break;
> spin_unlock(&inode->i_lock);
> - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
> + schedule();
> spin_lock(&inode->i_lock);
> - WARN_ON(inode->i_state & I_LRU_ISOLATING);
> }
> + finish_wait(wq_head, &wqe.wq_entry);
> + WARN_ON(inode->i_state & I_LRU_ISOLATING);
> }
>
> /**
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v3 6/6] inode: make i_state a u32
2024-08-23 12:47 ` [PATCH v3 6/6] inode: make i_state a u32 Christian Brauner
@ 2024-09-09 7:35 ` Jan Kara
0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2024-09-09 7:35 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, NeilBrown, Peter Zijlstra, Ingo Molnar,
Jeff Layton, Jan Kara, linux-fsdevel
On Fri 23-08-24 14:47:40, Christian Brauner wrote:
> Now that we use the wait var event mechanism make i_state a u32 and free
> up 4 bytes. This means we currently have two 4 byte holes in struct
> inode which we can pack.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/fs.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f257f8fad7d0..746ac60cef92 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -681,7 +681,8 @@ struct inode {
> #endif
>
> /* Misc */
> - unsigned long i_state;
> + u32 i_state;
> + /* 32-bit hole */
> struct rw_semaphore i_rwsem;
>
> unsigned long dirtied_when; /* jiffies of first dirtying */
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2024-09-09 7:35 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-21 15:47 [PATCH RFC v2 0/6] inode: turn i_state into u32 Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 1/6] fs: add i_state helpers Christian Brauner
2024-08-21 22:12 ` NeilBrown
2024-08-21 22:47 ` Linus Torvalds
2024-08-21 23:34 ` Dave Chinner
2024-08-23 0:08 ` NeilBrown
2024-08-22 8:27 ` Christian Brauner
2024-08-22 8:37 ` Linus Torvalds
2024-08-23 0:14 ` NeilBrown
2024-08-23 2:52 ` Linus Torvalds
2024-08-23 3:05 ` Linus Torvalds
2024-08-23 3:44 ` Linus Torvalds
2024-08-23 5:01 ` NeilBrown
2024-08-23 12:47 ` [PATCH v3 0/6] inode: turn i_state into u32 Christian Brauner
2024-08-23 12:47 ` [PATCH v3 1/6] fs: add i_state helpers Christian Brauner
2024-09-05 16:01 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 2/6] fs: reorder i_state bits Christian Brauner
2024-09-05 16:02 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 3/6] inode: port __I_SYNC to var event Christian Brauner
2024-09-05 16:02 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 4/6] inode: port __I_NEW " Christian Brauner
2024-09-06 13:30 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
2024-09-09 7:35 ` Jan Kara
2024-08-23 12:47 ` [PATCH v3 6/6] inode: make i_state a u32 Christian Brauner
2024-09-09 7:35 ` Jan Kara
2024-08-23 15:06 ` [PATCH v3 0/6] inode: turn i_state into u32 Josef Bacik
2024-08-21 22:28 ` [PATCH RFC v2 1/6] fs: add i_state helpers Dave Chinner
2024-08-21 22:53 ` Linus Torvalds
2024-08-21 15:47 ` [PATCH RFC v2 2/6] fs: reorder i_state bits Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 3/6] writeback: port __I_SYNC to var event Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 4/6] inode: port __I_NEW " Christian Brauner
2024-08-23 0:31 ` NeilBrown
2024-08-23 8:20 ` Christian Brauner
2024-08-23 11:07 ` Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 5/6] inode: port __I_LRU_ISOLATING " Christian Brauner
2024-08-21 19:41 ` Jeff Layton
2024-08-22 8:53 ` Christian Brauner
2024-08-22 9:48 ` Mateusz Guzik
2024-08-22 11:10 ` Christian Brauner
2024-08-22 12:46 ` Mateusz Guzik
2024-08-23 0:36 ` NeilBrown
2024-08-23 2:24 ` Linus Torvalds
2024-08-23 8:29 ` Christian Brauner
2024-08-21 15:47 ` [PATCH RFC v2 6/6] inode: make i_state a u32 Christian Brauner
2024-08-21 21:03 ` Andreas Dilger
2024-08-22 8:31 ` Christian Brauner
2024-08-21 19:15 ` [PATCH RFC v2 0/6] inode: turn i_state into u32 Josef Bacik
2024-08-21 19:42 ` Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).