[PATCH v6 00/10] Ext4 fast commit performance patch series

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 00/10] Ext4 fast commit performance patch series
@ 2024-05-29  1:19 Harshad Shirwadkar
  2024-05-29  1:19 ` [PATCH v6 01/10] ext4: convert i_fc_lock to spinlock Harshad Shirwadkar
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

This is the V6 of the patch series. This patch series contains fixes to a
bunch of kernel build warnings reported by Kernel Test Robot
(lkp@intel.com) and Dan Carpenter (dan.carpenter@linaro.org). Thank you!

V5 Cover Letter
---------------

This patch series supersedes the patch series "ext4: remove journal barrier
during fast commit" sent in July 2022[1].  Except following three patches
all the other patches are newly introduced in this series.

* ext4: convert i_fc_lock to spinlock
* ext4: for committing inode, make ext4_fc_track_inode wait
* ext4: drop i_fc_updates from inode fc info

This patchset does not introduce any on-disk format and / or user visible
API change. This patchset reworks fast-commit's commit path improve overall
performance of the commit path. Following optimizations have been added in
this series:

* Avoid having to lock the journal throughout the fast commit.
* Remove tracking of open handles per inode.
* Avoid issuing cache flushes when fast commits are contained within a
  single FUA writes and there is no data that needs flushing.
* Temporarily lift committing thread's priority to match that of the
  journal thread to reduce scheduling delays.

With the changes introduced in this patch series, now the commit path for
fast commits is as follows:

1. Lock the journal by calling jbd2_journal_lock_updates_no_rsv(). This
   ensures that all the exsiting handles finish and no new handles can
   start.
2. Mark all the fast commit eligible inodes as undergoing fast commit by
   setting "EXT4_STATE_FC_COMMITTING" state.
3. Unlock the journal by calling jbd2_journal_unlock_updates. This allows
   starting of new handles. If new handles try to start an update on any of
   the inodes that are being committed, ext4_fc_track_inode() will block
   until those inodes have finished the fast commit.
4. Submit data buffers of all the committing inodes.
5. Wait for [4] to complete.
6. Commit all the directory entry updates in the fast commit space.
7. Commit all the changed inodes in the fast commit space and clear
   "EXT4_STATE_FC_COMMITTING" for all the inodes.
8. Write tail tag to ensure atomicity of commits.

(The above flow has been documented in the code as well)

I verified that the patch series introduces no regressions in "log" groups
when "fast_commit" feature is enabled.

Also, we have a paper on fast commits in USENIX ATC 2024 this year which
should become available on the website[2] in a few months.

[1] https://lwn.net/Articles/902022/
[2] https://www.usenix.org/conference/atc24

Harshad Shirwadkar (10):
  ext4: convert i_fc_lock to spinlock
  ext4: for committing inode, make ext4_fc_track_inode wait
  ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr
  ext4: rework fast commit commit path
  ext4: drop i_fc_updates from inode fc info
  ext4: update code documentation
  ext4: add nolock mode to ext4_map_blocks()
  ext4: introduce selective flushing in fast commit
  ext4: temporarily elevate commit thread priority
  ext4: make fast commit ineligible on ext4_reserve_inode_write failure

 fs/ext4/ext4.h              |  29 ++--
 fs/ext4/ext4_jbd2.h         |  20 +--
 fs/ext4/fast_commit.c       | 279 ++++++++++++++++++++----------------
 fs/ext4/fast_commit.h       |   1 +
 fs/ext4/inline.c            |   3 +
 fs/ext4/inode.c             |  52 +++++--
 fs/ext4/super.c             |   7 +-
 fs/jbd2/journal.c           |   3 +-
 fs/jbd2/transaction.c       |  41 ++++--
 include/linux/jbd2.h        |   1 +
 include/trace/events/ext4.h |   7 +-
 11 files changed, 266 insertions(+), 177 deletions(-)

-- 
2.45.1.288.g0e0cd299f1-goog

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 01/10] ext4: convert i_fc_lock to spinlock
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
@ 2024-05-29  1:19 ` Harshad Shirwadkar
  2024-06-21 16:19   ` Jan Kara
  2024-05-29  1:19 ` [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait Harshad Shirwadkar
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

Convert ext4_inode_info->i_fc_lock to spinlock to avoid sleeping
in invalid contexts.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        |  7 +++++--
 fs/ext4/fast_commit.c | 24 +++++++++++-------------
 fs/ext4/super.c       |  2 +-
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 983dad8c07ec..611b8c80d99c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1062,8 +1062,11 @@ struct ext4_inode_info {
 	/* Fast commit wait queue for this inode */
 	wait_queue_head_t i_fc_wait;
 
-	/* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */
-	struct mutex i_fc_lock;
+	/*
+	 * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len
+	 * and inode's EXT4_FC_STATE_COMMITTING state bit.
+	 */
+	spinlock_t i_fc_lock;
 
 	/*
 	 * i_disksize keeps track of what the inode size is ON DISK, not
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 87c009e0c59a..a1aadebfcd66 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -382,7 +382,7 @@ static int ext4_fc_track_template(
 	int ret;
 
 	tid = handle->h_transaction->t_tid;
-	mutex_lock(&ei->i_fc_lock);
+	spin_lock(&ei->i_fc_lock);
 	if (tid == ei->i_sync_tid) {
 		update = true;
 	} else {
@@ -390,7 +390,7 @@ static int ext4_fc_track_template(
 		ei->i_sync_tid = tid;
 	}
 	ret = __fc_track_fn(inode, args, update);
-	mutex_unlock(&ei->i_fc_lock);
+	spin_unlock(&ei->i_fc_lock);
 
 	if (!enqueue)
 		return ret;
@@ -424,19 +424,19 @@ static int __track_dentry_update(struct inode *inode, void *arg, bool update)
 	struct super_block *sb = inode->i_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
-	mutex_unlock(&ei->i_fc_lock);
+	spin_unlock(&ei->i_fc_lock);
 
 	if (IS_ENCRYPTED(dir)) {
 		ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_ENCRYPTED_FILENAME,
 					NULL);
-		mutex_lock(&ei->i_fc_lock);
+		spin_lock(&ei->i_fc_lock);
 		return -EOPNOTSUPP;
 	}
 
 	node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
 	if (!node) {
 		ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_NOMEM, NULL);
-		mutex_lock(&ei->i_fc_lock);
+		spin_lock(&ei->i_fc_lock);
 		return -ENOMEM;
 	}
 
@@ -448,7 +448,7 @@ static int __track_dentry_update(struct inode *inode, void *arg, bool update)
 		if (!node->fcd_name.name) {
 			kmem_cache_free(ext4_fc_dentry_cachep, node);
 			ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_NOMEM, NULL);
-			mutex_lock(&ei->i_fc_lock);
+			spin_lock(&ei->i_fc_lock);
 			return -ENOMEM;
 		}
 		memcpy((u8 *)node->fcd_name.name, dentry->d_name.name,
@@ -482,7 +482,7 @@ static int __track_dentry_update(struct inode *inode, void *arg, bool update)
 		list_add_tail(&node->fcd_dilist, &ei->i_fc_dilist);
 	}
 	spin_unlock(&sbi->s_fc_lock);
-	mutex_lock(&ei->i_fc_lock);
+	spin_lock(&ei->i_fc_lock);
 
 	return 0;
 }
@@ -614,10 +614,8 @@ static int __track_range(struct inode *inode, void *arg, bool update)
 	struct __track_range_args *__arg =
 		(struct __track_range_args *)arg;
 
-	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) {
-		ext4_debug("Special inode %ld being modified\n", inode->i_ino);
+	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb))
 		return -ECANCELED;
-	}
 
 	oldstart = ei->i_fc_lblk_start;
 
@@ -896,15 +894,15 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 	struct ext4_extent *ex;
 	int ret;
 
-	mutex_lock(&ei->i_fc_lock);
+	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
-		mutex_unlock(&ei->i_fc_lock);
+		spin_unlock(&ei->i_fc_lock);
 		return 0;
 	}
 	old_blk_size = ei->i_fc_lblk_start;
 	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
 	ei->i_fc_lblk_len = 0;
-	mutex_unlock(&ei->i_fc_lock);
+	spin_unlock(&ei->i_fc_lock);
 
 	cur_lblk_off = old_blk_size;
 	ext4_debug("will try writing %d to %d for inode %ld\n",
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f9a4a4e89dac..77173ec91e49 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1436,7 +1436,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	atomic_set(&ei->i_unwritten, 0);
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
 	ext4_fc_init_inode(&ei->vfs_inode);
-	mutex_init(&ei->i_fc_lock);
+	spin_lock_init(&ei->i_fc_lock);
 	return &ei->vfs_inode;
 }
 
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
  2024-05-29  1:19 ` [PATCH v6 01/10] ext4: convert i_fc_lock to spinlock Harshad Shirwadkar
@ 2024-05-29  1:19 ` Harshad Shirwadkar
  2024-06-21 16:33   ` Jan Kara
                     ` (2 more replies)
  2024-05-29  1:19 ` [PATCH v6 03/10] ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr Harshad Shirwadkar
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

If the inode that's being requested to track using ext4_fc_track_inode
is being committed, then wait until the inode finishes the
commit. Also, add calls to ext4_fc_track_inode at the right places.

With this patch, now calling ext4_reserve_inode_write() results in
inode being tracked for next fast commit. A subtle lock ordering
requirement with i_data_sem (which is documented in the code) requires
that ext4_fc_track_inode() be called before grabbing i_data_sem. So,
this patch also adds explicit ext4_fc_track_inode() calls in places
where i_data_sem grabbed.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/fast_commit.c | 34 ++++++++++++++++++++++++++++++++++
 fs/ext4/inline.c      |  3 +++
 fs/ext4/inode.c       |  4 ++++
 3 files changed, 41 insertions(+)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index a1aadebfcd66..fa93ce399440 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -581,6 +581,8 @@ static int __track_inode(struct inode *inode, void *arg, bool update)
 
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 {
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	wait_queue_head_t *wq;
 	int ret;
 
 	if (S_ISDIR(inode->i_mode))
@@ -598,6 +600,38 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 	if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
 		return;
 
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) ||
+		ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE) ||
+		!list_empty(&ei->i_fc_list))
+		return;
+
+	/*
+	 * If we come here, we may sleep while waiting for the inode to
+	 * commit. We shouldn't be holding i_data_sem in write mode when we go
+	 * to sleep since the commit path needs to grab the lock while
+	 * committing the inode.
+	 */
+	WARN_ON(lockdep_is_held_type(&ei->i_data_sem, 1));
+
+	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+			schedule();
+		finish_wait(wq, &wait.wq_entry);
+	}
+
 	ret = ext4_fc_track_template(handle, inode, __track_inode, NULL, 1);
 	trace_ext4_fc_track_inode(handle, inode, ret);
 }
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index d5bd1e3a5d36..bd5778e680c0 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -596,6 +596,7 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
 			goto out;
 	}
 
+	ext4_fc_track_inode(handle, inode);
 	ret = ext4_destroy_inline_data_nolock(handle, inode);
 	if (ret)
 		goto out;
@@ -696,6 +697,7 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
 		goto convert;
 	}
 
+	ext4_fc_track_inode(handle, inode);
 	ret = ext4_journal_get_write_access(handle, inode->i_sb, iloc.bh,
 					    EXT4_JTR_NONE);
 	if (ret)
@@ -948,6 +950,7 @@ int ext4_da_write_inline_data_begin(struct address_space *mapping,
 		if (ret < 0)
 			goto out_release_page;
 	}
+	ext4_fc_track_inode(handle, inode);
 	ret = ext4_journal_get_write_access(handle, inode->i_sb, iloc.bh,
 					    EXT4_JTR_NONE);
 	if (ret)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4bae9ccf5fe0..aa6440992a55 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -607,6 +607,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 */
 	map->m_flags &= ~EXT4_MAP_FLAGS;
 
+	ext4_fc_track_inode(handle, inode);
 	/*
 	 * New blocks allocate and/or writing to unwritten extent
 	 * will possibly result in updating i_data, so we take
@@ -3978,6 +3979,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	if (stop_block > first_block) {
 		ext4_lblk_t hole_len = stop_block - first_block;
 
+		ext4_fc_track_inode(handle, inode);
 		down_write(&EXT4_I(inode)->i_data_sem);
 		ext4_discard_preallocations(inode);
 
@@ -4131,6 +4133,7 @@ int ext4_truncate(struct inode *inode)
 	if (err)
 		goto out_stop;
 
+	ext4_fc_track_inode(handle, inode);
 	down_write(&EXT4_I(inode)->i_data_sem);
 
 	ext4_discard_preallocations(inode);
@@ -5727,6 +5730,7 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 			brelse(iloc->bh);
 			iloc->bh = NULL;
 		}
+		ext4_fc_track_inode(handle, inode);
 	}
 	ext4_std_error(inode->i_sb, err);
 	return err;
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 03/10] ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
  2024-05-29  1:19 ` [PATCH v6 01/10] ext4: convert i_fc_lock to spinlock Harshad Shirwadkar
  2024-05-29  1:19 ` [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait Harshad Shirwadkar
@ 2024-05-29  1:19 ` Harshad Shirwadkar
  2024-06-28 13:15   ` Jan Kara
  2024-05-29  1:19 ` [PATCH v6 04/10] ext4: rework fast commit commit path Harshad Shirwadkar
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

Mark inode dirty first and then grab i_data_sem in ext4_setattr().

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/inode.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index aa6440992a55..61ffbdc2fb16 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5410,12 +5410,13 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
 					inode->i_sb->s_blocksize_bits);
 
-			down_write(&EXT4_I(inode)->i_data_sem);
-			old_disksize = EXT4_I(inode)->i_disksize;
-			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
 			if (!error)
 				error = rc;
+			down_write(&EXT4_I(inode)->i_data_sem);
+			old_disksize = EXT4_I(inode)->i_disksize;
+			EXT4_I(inode)->i_disksize = attr->ia_size;
+
 			/*
 			 * We have to update i_size under i_data_sem together
 			 * with i_disksize to avoid races with writeback code
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 04/10] ext4: rework fast commit commit path
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
                   ` (2 preceding siblings ...)
  2024-05-29  1:19 ` [PATCH v6 03/10] ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr Harshad Shirwadkar
@ 2024-05-29  1:19 ` Harshad Shirwadkar
  2024-06-28 13:43   ` Jan Kara
  2024-05-29  1:19 ` [PATCH v6 05/10] ext4: drop i_fc_updates from inode fc info Harshad Shirwadkar
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

This patch reworks fast commit's commit path to remove locking the
journal for the entire duration of a fast commit. Instead, we only lock
the journal while marking all the eligible inodes as "committing". This
allows handles to make progress in parallel with the fast commit.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/fast_commit.c | 75 ++++++++++++++++++++++++++++---------------
 fs/jbd2/journal.c     |  3 +-
 fs/jbd2/transaction.c | 41 +++++++++++++++--------
 include/linux/jbd2.h  |  1 +
 4 files changed, 80 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index fa93ce399440..3aca5b20aac5 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -290,20 +290,30 @@ void ext4_fc_del(struct inode *inode)
 	if (ext4_fc_disabled(inode->i_sb))
 		return;
 
-restart:
 	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
 	if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) {
 		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
 		return;
 	}
 
-	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-		ext4_fc_wait_committing_inode(inode);
-		goto restart;
-	}
-
-	if (!list_empty(&ei->i_fc_list))
-		list_del_init(&ei->i_fc_list);
+	/*
+	 * Since ext4_fc_del is called from ext4_evict_inode while having a
+	 * handle open, there is no need for us to wait here even if a fast
+	 * commit is going on. That is because, if this inode is being
+	 * committed, ext4_mark_inode_dirty would have waited for inode commit
+	 * operation to finish before we come here. So, by the time we come
+	 * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So,
+	 * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode
+	 * here.
+	 *
+	 * We may come here without any handles open in the "no_delete" case of
+	 * ext4_evict_inode as well. However, if that happens, we first mark the
+	 * file system as fast commit ineligible anyway. So, even in that case,
+	 * it is okay to remove the inode from the fc list.
+	 */
+	WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)
+		&& !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE));
+	list_del_init(&ei->i_fc_list);
 
 	/*
 	 * Since this inode is getting removed, let's also remove all FC
@@ -326,8 +336,6 @@ void ext4_fc_del(struct inode *inode)
 		fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
 		kfree(fc_dentry->fcd_name.name);
 	kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry);
-
-	return;
 }
 
 /*
@@ -999,19 +1007,6 @@ static int ext4_fc_submit_inode_data_all(journal_t *journal)
 
 	spin_lock(&sbi->s_fc_lock);
 	list_for_each_entry(ei, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
-		ext4_set_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING);
-		while (atomic_read(&ei->i_fc_updates)) {
-			DEFINE_WAIT(wait);
-
-			prepare_to_wait(&ei->i_fc_wait, &wait,
-						TASK_UNINTERRUPTIBLE);
-			if (atomic_read(&ei->i_fc_updates)) {
-				spin_unlock(&sbi->s_fc_lock);
-				schedule();
-				spin_lock(&sbi->s_fc_lock);
-			}
-			finish_wait(&ei->i_fc_wait, &wait);
-		}
 		spin_unlock(&sbi->s_fc_lock);
 		ret = jbd2_submit_inode_data(journal, ei->jinode);
 		if (ret)
@@ -1124,6 +1119,20 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	int ret = 0;
 	u32 crc = 0;
 
+	/*
+	 * Wait for all the handles of the current transaction to complete
+	 * and then lock the journal. Since this is essentially the commit
+	 * path, we don't need to wait for reserved handles.
+	 */
+	jbd2_journal_lock_updates_no_rsv(journal);
+	spin_lock(&sbi->s_fc_lock);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		ext4_set_inode_state(&iter->vfs_inode,
+				     EXT4_STATE_FC_COMMITTING);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+	jbd2_journal_unlock_updates(journal);
+
 	ret = ext4_fc_submit_inode_data_all(journal);
 	if (ret)
 		return ret;
@@ -1174,6 +1183,18 @@ static int ext4_fc_perform_commit(journal_t *journal)
 		ret = ext4_fc_write_inode(inode, &crc);
 		if (ret)
 			goto out;
+		ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+		/*
+		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
+		 * visible before we send the wakeup. Pairs with implicit
+		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 */
+		smp_mb();
+#if (BITS_PER_LONG < 64)
+		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
+#else
+		wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
+#endif
 		spin_lock(&sbi->s_fc_lock);
 	}
 	spin_unlock(&sbi->s_fc_lock);
@@ -1311,13 +1332,17 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 	spin_lock(&sbi->s_fc_lock);
 	list_for_each_entry_safe(iter, iter_n, &sbi->s_fc_q[FC_Q_MAIN],
 				 i_fc_list) {
-		list_del_init(&iter->i_fc_list);
 		ext4_clear_inode_state(&iter->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
 		if (iter->i_sync_tid <= tid)
 			ext4_fc_reset_inode(&iter->vfs_inode);
-		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
+		/*
+		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
+		 * visible before we send the wakeup. Pairs with implicit
+		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 */
 		smp_mb();
+		list_del_init(&iter->i_fc_list);
 #if (BITS_PER_LONG < 64)
 		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
 #else
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b6c114c11b97..7df1720f013b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -58,6 +58,7 @@ MODULE_PARM_DESC(jbd2_debug, "Debugging level for jbd2");
 EXPORT_SYMBOL(jbd2_journal_extend);
 EXPORT_SYMBOL(jbd2_journal_stop);
 EXPORT_SYMBOL(jbd2_journal_lock_updates);
+EXPORT_SYMBOL(jbd2_journal_lock_updates_no_rsv);
 EXPORT_SYMBOL(jbd2_journal_unlock_updates);
 EXPORT_SYMBOL(jbd2_journal_get_write_access);
 EXPORT_SYMBOL(jbd2_journal_get_create_access);
@@ -742,7 +743,6 @@ int jbd2_fc_begin_commit(journal_t *journal, tid_t tid)
 	}
 	journal->j_flags |= JBD2_FAST_COMMIT_ONGOING;
 	write_unlock(&journal->j_state_lock);
-	jbd2_journal_lock_updates(journal);
 
 	return 0;
 }
@@ -754,7 +754,6 @@ EXPORT_SYMBOL(jbd2_fc_begin_commit);
  */
 static int __jbd2_fc_end_commit(journal_t *journal, tid_t tid, bool fallback)
 {
-	jbd2_journal_unlock_updates(journal);
 	if (journal->j_fc_cleanup_callback)
 		journal->j_fc_cleanup_callback(journal, 0, tid);
 	write_lock(&journal->j_state_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index cb0b8d6fc0c6..4361e5c56490 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -865,25 +865,15 @@ void jbd2_journal_wait_updates(journal_t *journal)
 	}
 }
 
-/**
- * jbd2_journal_lock_updates () - establish a transaction barrier.
- * @journal:  Journal to establish a barrier on.
- *
- * This locks out any further updates from being started, and blocks
- * until all existing updates have completed, returning only once the
- * journal is in a quiescent state with no updates running.
- *
- * The journal lock should not be held on entry.
- */
-void jbd2_journal_lock_updates(journal_t *journal)
+static void __jbd2_journal_lock_updates(journal_t *journal, bool wait_on_rsv)
 {
 	jbd2_might_wait_for_commit(journal);
 
 	write_lock(&journal->j_state_lock);
 	++journal->j_barrier_count;
 
-	/* Wait until there are no reserved handles */
-	if (atomic_read(&journal->j_reserved_credits)) {
+	if (wait_on_rsv && atomic_read(&journal->j_reserved_credits)) {
+		/* Wait until there are no reserved handles */
 		write_unlock(&journal->j_state_lock);
 		wait_event(journal->j_wait_reserved,
 			   atomic_read(&journal->j_reserved_credits) == 0);
@@ -904,6 +894,31 @@ void jbd2_journal_lock_updates(journal_t *journal)
 	mutex_lock(&journal->j_barrier);
 }
 
+/**
+ * jbd2_journal_lock_updates () - establish a transaction barrier.
+ * @journal:  Journal to establish a barrier on.
+ *
+ * This locks out any further updates from being started, and blocks
+ * until all existing updates have completed, returning only once the
+ * journal is in a quiescent state with no updates running.
+ *
+ * The journal lock should not be held on entry.
+ */
+void jbd2_journal_lock_updates(journal_t *journal)
+{
+	__jbd2_journal_lock_updates(journal, true);
+}
+
+/**
+ * jbd2_journal_lock_updates_no_rsv - same as above, but doesn't
+ * wait for reserved handles to finish.
+ * @journal: Journal to establish barrier on.
+ */
+void jbd2_journal_lock_updates_no_rsv(journal_t *journal)
+{
+	__jbd2_journal_lock_updates(journal, false);
+}
+
 /**
  * jbd2_journal_unlock_updates () - release barrier
  * @journal:  Journal to release the barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 7479f64c0939..cfac287c0ad4 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1531,6 +1531,7 @@ bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio *folio);
 extern int	 jbd2_journal_stop(handle_t *);
 extern int	 jbd2_journal_flush(journal_t *journal, unsigned int flags);
 extern void	 jbd2_journal_lock_updates (journal_t *);
+extern void	 jbd2_journal_lock_updates_no_rsv(journal_t *journal);
 extern void	 jbd2_journal_unlock_updates (journal_t *);
 
 void jbd2_journal_wait_updates(journal_t *);
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 05/10] ext4: drop i_fc_updates from inode fc info
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
                   ` (3 preceding siblings ...)
  2024-05-29  1:19 ` [PATCH v6 04/10] ext4: rework fast commit commit path Harshad Shirwadkar
@ 2024-05-29  1:19 ` Harshad Shirwadkar
  2024-05-29  1:19 ` [PATCH v6 06/10] ext4: update code documentation Harshad Shirwadkar
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar, Jan Kara

The new logic introduced in this series does not require tracking number
of active handles open on an inode. So, drop it.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h        |  5 ----
 fs/ext4/fast_commit.c | 68 -------------------------------------------
 2 files changed, 73 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 611b8c80d99c..d802040e94df 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1056,9 +1056,6 @@ struct ext4_inode_info {
 	/* End of lblk range that needs to be committed in this fast commit */
 	ext4_lblk_t i_fc_lblk_len;
 
-	/* Number of ongoing updates on this inode */
-	atomic_t  i_fc_updates;
-
 	/* Fast commit wait queue for this inode */
 	wait_queue_head_t i_fc_wait;
 
@@ -2903,8 +2900,6 @@ void __ext4_fc_track_create(handle_t *handle, struct inode *inode,
 void ext4_fc_track_create(handle_t *handle, struct dentry *dentry);
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode);
 void ext4_fc_mark_ineligible(struct super_block *sb, int reason, handle_t *handle);
-void ext4_fc_start_update(struct inode *inode);
-void ext4_fc_stop_update(struct inode *inode);
 void ext4_fc_del(struct inode *inode);
 bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block);
 void ext4_fc_replay_cleanup(struct super_block *sb);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 3aca5b20aac5..ecbbcaf78598 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -201,32 +201,6 @@ void ext4_fc_init_inode(struct inode *inode)
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
 	init_waitqueue_head(&ei->i_fc_wait);
-	atomic_set(&ei->i_fc_updates, 0);
-}
-
-/* This function must be called with sbi->s_fc_lock held. */
-static void ext4_fc_wait_committing_inode(struct inode *inode)
-__releases(&EXT4_SB(inode->i_sb)->s_fc_lock)
-{
-	wait_queue_head_t *wq;
-	struct ext4_inode_info *ei = EXT4_I(inode);
-
-#if (BITS_PER_LONG < 64)
-	DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
-			EXT4_STATE_FC_COMMITTING);
-	wq = bit_waitqueue(&ei->i_state_flags,
-				EXT4_STATE_FC_COMMITTING);
-#else
-	DEFINE_WAIT_BIT(wait, &ei->i_flags,
-			EXT4_STATE_FC_COMMITTING);
-	wq = bit_waitqueue(&ei->i_flags,
-				EXT4_STATE_FC_COMMITTING);
-#endif
-	lockdep_assert_held(&EXT4_SB(inode->i_sb)->s_fc_lock);
-	prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
-	schedule();
-	finish_wait(wq, &wait.wq_entry);
 }
 
 static bool ext4_fc_disabled(struct super_block *sb)
@@ -235,48 +209,6 @@ static bool ext4_fc_disabled(struct super_block *sb)
 		(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY));
 }
 
-/*
- * Inform Ext4's fast about start of an inode update
- *
- * This function is called by the high level call VFS callbacks before
- * performing any inode update. This function blocks if there's an ongoing
- * fast commit on the inode in question.
- */
-void ext4_fc_start_update(struct inode *inode)
-{
-	struct ext4_inode_info *ei = EXT4_I(inode);
-
-	if (ext4_fc_disabled(inode->i_sb))
-		return;
-
-restart:
-	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
-	if (list_empty(&ei->i_fc_list))
-		goto out;
-
-	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-		ext4_fc_wait_committing_inode(inode);
-		goto restart;
-	}
-out:
-	atomic_inc(&ei->i_fc_updates);
-	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
-}
-
-/*
- * Stop inode update and wake up waiting fast commits if any.
- */
-void ext4_fc_stop_update(struct inode *inode)
-{
-	struct ext4_inode_info *ei = EXT4_I(inode);
-
-	if (ext4_fc_disabled(inode->i_sb))
-		return;
-
-	if (atomic_dec_and_test(&ei->i_fc_updates))
-		wake_up_all(&ei->i_fc_wait);
-}
-
 /*
  * Remove inode from fast commit list. If the inode is being committed
  * we wait until inode commit is done.
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 06/10] ext4: update code documentation
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
                   ` (4 preceding siblings ...)
  2024-05-29  1:19 ` [PATCH v6 05/10] ext4: drop i_fc_updates from inode fc info Harshad Shirwadkar
@ 2024-05-29  1:19 ` Harshad Shirwadkar
  2024-05-29  1:20 ` [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks() Harshad Shirwadkar
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar, Jan Kara

This patch updates code documentation to reflect the commit path changes
made in this series.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/fast_commit.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index ecbbcaf78598..b81b0292aa59 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -49,14 +49,21 @@
  * that need to be committed during a fast commit in another in memory queue of
  * inodes. During the commit operation, we commit in the following order:
  *
- * [1] Lock inodes for any further data updates by setting COMMITTING state
- * [2] Submit data buffers of all the inodes
- * [3] Wait for [2] to complete
- * [4] Commit all the directory entry updates in the fast commit space
- * [5] Commit all the changed inode structures
- * [6] Write tail tag (this tag ensures the atomicity, please read the following
+ * [1] Lock the journal by calling jbd2_journal_lock_updates. This ensures that
+ *     all the exsiting handles finish and no new handles can start.
+ * [2] Mark all the fast commit eligible inodes as undergoing fast commit
+ *     by setting "EXT4_STATE_FC_COMMITTING" state.
+ * [3] Unlock the journal by calling jbd2_journal_unlock_updates. This allows
+ *     starting of new handles. If new handles try to start an update on
+ *     any of the inodes that are being committed, ext4_fc_track_inode()
+ *     will block until those inodes have finished the fast commit.
+ * [4] Submit data buffers of all the committing inodes.
+ * [5] Wait for [4] to complete.
+ * [6] Commit all the directory entry updates in the fast commit space.
+ * [7] Commit all the changed inodes in the fast commit space and clear
+ *     "EXT4_STATE_FC_COMMITTING" for these inodes.
+ * [8] Write tail tag (this tag ensures the atomicity, please read the following
  *     section for more details).
- * [7] Wait for [4], [5] and [6] to complete.
  *
  * All the inode updates must call ext4_fc_start_update() before starting an
  * update. If such an ongoing update is present, fast commit waits for it to
@@ -142,6 +149,13 @@
  * similarly. Thus, by converting a non-idempotent procedure into a series of
  * idempotent outcomes, fast commits ensured idempotence during the replay.
  *
+ * Locking
+ * -------
+ * sbi->s_fc_lock protects the fast commit inodes queue and the fast commit
+ * dentry queue. ei->i_fc_lock protects the fast commit related info in a given
+ * inode. Most of the code avoids acquiring both the locks, but if one must do
+ * that then sbi->s_fc_lock must be acquired before ei->i_fc_lock.
+ *
  * TODOs
  * -----
  *
@@ -156,13 +170,12 @@
  *    fast commit recovery even if that area is invalidated by later full
  *    commits.
  *
- * 1) Fast commit's commit path locks the entire file system during fast
- *    commit. This has significant performance penalty. Instead of that, we
- *    should use ext4_fc_start/stop_update functions to start inode level
- *    updates from ext4_journal_start/stop. Once we do that we can drop file
- *    system locking during commit path.
+ * 1) Handle more ineligible cases.
  *
- * 2) Handle more ineligible cases.
+ * 2) Change ext4_fc_commit() to lookup logical to physical mapping using extent
+ *    status tree. This would get rid of the need to call ext4_fc_track_inode()
+ *    before acquiring i_data_sem. To do that we would need to ensure that
+ *    modified extents from the extent status tree are not evicted from memory.
  */
 
 #include <trace/events/ext4.h>
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks()
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
                   ` (5 preceding siblings ...)
  2024-05-29  1:19 ` [PATCH v6 06/10] ext4: update code documentation Harshad Shirwadkar
@ 2024-05-29  1:20 ` Harshad Shirwadkar
  2024-06-28 14:18   ` Jan Kara
  2024-05-29  1:20 ` [PATCH v6 08/10] ext4: introduce selective flushing in fast commit Harshad Shirwadkar
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:20 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

Add nolock flag to ext4_map_blocks() which skips grabbing
i_data_sem in ext4_map_blocks. In FC commit path, we first
mark the inode as committing and thereby prevent any mutations
on it. Thus, it should be safe to call ext4_map_blocks()
without i_data_sem in this case. This is a workaround to
the problem mentioned in RFC V4 version cover letter[1] of this
patch series which pointed out that there is in incosistency between
ext4_map_blocks() behavior when EXT4_GET_BLOCKS_CACHED_NOWAIT is
passed. This patch gets rid of the need to call ext4_map_blocks()
with EXT4_GET_BLOCKS_CACHED_NOWAIT and instead call it with
EXT4_GET_BLOCKS_NOLOCK. I verified that generic/311 which failed
in cached_nowait mode passes with nolock mode.

[1] https://lwn.net/Articles/902022/

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        |  1 +
 fs/ext4/fast_commit.c | 16 ++++++++--------
 fs/ext4/inode.c       | 14 ++++++++++++--
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d802040e94df..196c513f82dd 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -720,6 +720,7 @@ enum {
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 	/* Caller is in the atomic contex, find extent if it has been cached */
 #define EXT4_GET_BLOCKS_CACHED_NOWAIT		0x0800
+#define EXT4_GET_BLOCKS_NOLOCK			0x1000
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index b81b0292aa59..0b7064f8dfa5 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -559,13 +559,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 		!list_empty(&ei->i_fc_list))
 		return;
 
-	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem in write mode when we go
-	 * to sleep since the commit path needs to grab the lock while
-	 * committing the inode.
-	 */
-	WARN_ON(lockdep_is_held_type(&ei->i_data_sem, 1));
 
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
 #if (BITS_PER_LONG < 64)
@@ -898,7 +891,14 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 	while (cur_lblk_off <= new_blk_size) {
 		map.m_lblk = cur_lblk_off;
 		map.m_len = new_blk_size - cur_lblk_off + 1;
-		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		/*
+		 * Given that this inode is being committed,
+		 * EXT4_STATE_FC_COMMITTING is already set on this inode.
+		 * Which means all the mutations on the inode are paused
+		 * until the commit operation is complete. Thus it is safe
+		 * call ext4_map_blocks() in no lock mode.
+		 */
+		ret = ext4_map_blocks(NULL, inode, &map, EXT4_GET_BLOCKS_NOLOCK);
 		if (ret < 0)
 			return -ECANCELED;
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 61ffbdc2fb16..f00408017c7a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -546,7 +546,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 * Try to see if we can get the block without requesting a new
 	 * file system block.
 	 */
-	down_read(&EXT4_I(inode)->i_data_sem);
+	if (!(flags & EXT4_GET_BLOCKS_NOLOCK))
+		down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
 		retval = ext4_ext_map_blocks(handle, inode, map, 0);
 	} else {
@@ -573,7 +574,15 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
 				      map->m_pblk, status);
 	}
-	up_read((&EXT4_I(inode)->i_data_sem));
+	/*
+	 * We should never call ext4_map_blocks() in nolock mode outside
+	 * of fast commit path.
+	 */
+	WARN_ON((flags & EXT4_GET_BLOCKS_NOLOCK) &&
+		!ext4_test_inode_state(inode,
+				       EXT4_STATE_FC_COMMITTING));
+	if (!(flags & EXT4_GET_BLOCKS_NOLOCK))
+		up_read((&EXT4_I(inode)->i_data_sem));
 
 found:
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
@@ -614,6 +623,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 * the write lock of i_data_sem, and call get_block()
 	 * with create == 1 flag.
 	 */
+	WARN_ON((flags & EXT4_GET_BLOCKS_NOLOCK) != 0);
 	down_write(&EXT4_I(inode)->i_data_sem);
 
 	/*
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 08/10] ext4: introduce selective flushing in fast commit
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
                   ` (6 preceding siblings ...)
  2024-05-29  1:20 ` [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks() Harshad Shirwadkar
@ 2024-05-29  1:20 ` Harshad Shirwadkar
  2024-06-28 14:33   ` Jan Kara
  2024-05-29  1:20 ` [PATCH v6 09/10] ext4: temporarily elevate commit thread priority Harshad Shirwadkar
  2024-05-29  1:20 ` [PATCH v6 10/10] ext4: make fast commit ineligible on ext4_reserve_inode_write failure Harshad Shirwadkar
  9 siblings, 1 reply; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:20 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

With fast commits, if the entire commit is contained within a single
block and there isn't any data that needs a flush, we can avoid sending
expensive cache flush to disk. Single block metadata only fast commits
can be written using FUA to guarantee consistency.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        | 12 ++++++++++++
 fs/ext4/ext4_jbd2.h   | 20 ++++++++++++--------
 fs/ext4/fast_commit.c | 23 ++++++++++++++++++-----
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 196c513f82dd..3721daea2890 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1744,6 +1744,13 @@ struct ext4_sb_info {
 					 */
 	struct list_head s_fc_dentry_q[2];	/* directory entry updates */
 	unsigned int s_fc_bytes;
+
+	/*
+	 * This flag indicates whether a full flush is needed on
+	 * next fast commit.
+	 */
+	int fc_flush_required;
+
 	/*
 	 * Main fast commit lock. This lock protects accesses to the
 	 * following fields:
@@ -2905,6 +2912,11 @@ void ext4_fc_del(struct inode *inode);
 bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block);
 void ext4_fc_replay_cleanup(struct super_block *sb);
 int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
+static inline void ext4_fc_mark_needs_flush(struct super_block *sb)
+{
+	EXT4_SB(sb)->fc_flush_required = 1;
+}
+
 int __init ext4_fc_init_dentry_cache(void);
 void ext4_fc_destroy_dentry_cache(void);
 int ext4_fc_record_regions(struct super_block *sb, int ino,
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 0c77697d5e90..e3a4f5c49b6e 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -420,19 +420,23 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 static inline int ext4_jbd2_inode_add_write(handle_t *handle,
 		struct inode *inode, loff_t start_byte, loff_t length)
 {
-	if (ext4_handle_valid(handle))
-		return jbd2_journal_inode_ranged_write(handle,
-				EXT4_I(inode)->jinode, start_byte, length);
-	return 0;
+	if (!ext4_handle_valid(handle))
+		return 0;
+
+	ext4_fc_mark_needs_flush(inode->i_sb);
+	return jbd2_journal_inode_ranged_write(handle,
+			EXT4_I(inode)->jinode, start_byte, length);
 }
 
 static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
 		struct inode *inode, loff_t start_byte, loff_t length)
 {
-	if (ext4_handle_valid(handle))
-		return jbd2_journal_inode_ranged_wait(handle,
-				EXT4_I(inode)->jinode, start_byte, length);
-	return 0;
+	if (!ext4_handle_valid(handle))
+		return 0;
+
+	ext4_fc_mark_needs_flush(inode->i_sb);
+	return jbd2_journal_inode_ranged_wait(handle,
+			EXT4_I(inode)->jinode, start_byte, length);
 }
 
 static inline void ext4_update_inode_fsync_trans(handle_t *handle,
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 0b7064f8dfa5..35c89bee452c 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -638,11 +638,24 @@ void ext4_fc_track_range(handle_t *handle, struct inode *inode, ext4_lblk_t star
 static void ext4_fc_submit_bh(struct super_block *sb, bool is_tail)
 {
 	blk_opf_t write_flags = REQ_SYNC;
-	struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct buffer_head *bh = sbi->s_fc_bh;
+	int old = 1, new = 0;
+
+	if (!is_tail) {
+		/*
+		 * This commit has at least 1 non-tail block,
+		 * thus FLUSH is required.
+		 */
+		ext4_fc_mark_needs_flush(sb);
+	} else {
+		/* Use cmpxchg to ensure that no flush requrest is lost. */
+		if (cmpxchg(&sbi->fc_flush_required, old, new))
+			/* Old value was 1, so request a flush. */
+			write_flags |= REQ_PREFLUSH;
+		write_flags |= REQ_FUA;
+	}
 
-	/* Add REQ_FUA | REQ_PREFLUSH only its tail */
-	if (test_opt(sb, BARRIER) && is_tail)
-		write_flags |= REQ_FUA | REQ_PREFLUSH;
 	lock_buffer(bh);
 	set_buffer_dirty(bh);
 	set_buffer_uptodate(bh);
@@ -1090,7 +1103,7 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	 * If file system device is different from journal device, issue a cache
 	 * flush before we start writing fast commit blocks.
 	 */
-	if (journal->j_fs_dev != journal->j_dev)
+	if (sbi->fc_flush_required && journal->j_fs_dev != journal->j_dev)
 		blkdev_issue_flush(journal->j_fs_dev);
 
 	blk_start_plug(&plug);
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 09/10] ext4: temporarily elevate commit thread priority
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
                   ` (7 preceding siblings ...)
  2024-05-29  1:20 ` [PATCH v6 08/10] ext4: introduce selective flushing in fast commit Harshad Shirwadkar
@ 2024-05-29  1:20 ` Harshad Shirwadkar
  2024-06-28 14:42   ` Jan Kara
  2024-05-29  1:20 ` [PATCH v6 10/10] ext4: make fast commit ineligible on ext4_reserve_inode_write failure Harshad Shirwadkar
  9 siblings, 1 reply; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:20 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

Unlike JBD2 based full commits, there is no dedicated journal thread
for fast commits. Thus to reduce scheduling delays between IO
submission and completion, temporarily elevate the committer thread's
priority to match the configured priority of the JBD2 journal
thread.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        |  4 +++-
 fs/ext4/fast_commit.c | 13 +++++++++++++
 fs/ext4/super.c       |  5 ++---
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3721daea2890..d52df8a85271 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2287,10 +2287,12 @@ static inline int ext4_forced_shutdown(struct super_block *sb)
 #define EXT4_DEFM_NODELALLOC	0x0800
 
 /*
- * Default journal batch times
+ * Default journal batch times and ioprio.
  */
 #define EXT4_DEF_MIN_BATCH_TIME	0
 #define EXT4_DEF_MAX_BATCH_TIME	15000 /* 15ms */
+#define EXT4_DEF_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
+
 
 /*
  * Minimum number of groups in a flexgroup before we separate out
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 35c89bee452c..55a13d3ff681 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1205,6 +1205,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	int subtid = atomic_read(&sbi->s_fc_subtid);
 	int status = EXT4_FC_STATUS_OK, fc_bufs_before = 0;
 	ktime_t start_time, commit_time;
+	int old_ioprio, journal_ioprio;
 
 	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
 		return jbd2_complete_transaction(journal, commit_tid);
@@ -1212,6 +1213,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	trace_ext4_fc_commit_start(sb, commit_tid);
 
 	start_time = ktime_get();
+	old_ioprio = get_current_ioprio();
 
 restart_fc:
 	ret = jbd2_fc_begin_commit(journal, commit_tid);
@@ -1242,6 +1244,15 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 		goto fallback;
 	}
 
+	/*
+	 * Now that we know that this thread is going to do a fast commit,
+	 * elevate the priority to match that of the journal thread.
+	 */
+	if (journal->j_task->io_context)
+		journal_ioprio = sbi->s_journal->j_task->io_context->ioprio;
+	else
+		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
+	set_task_ioprio(current, journal_ioprio);
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
 	ret = ext4_fc_perform_commit(journal);
 	if (ret < 0) {
@@ -1256,6 +1267,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	}
 	atomic_inc(&sbi->s_fc_subtid);
 	ret = jbd2_fc_end_commit(journal);
+	set_task_ioprio(current, old_ioprio);
 	/*
 	 * weight the commit time higher than the average time so we
 	 * don't react too strongly to vast changes in the commit time
@@ -1265,6 +1277,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	return ret;
 
 fallback:
+	set_task_ioprio(current, old_ioprio);
 	ret = jbd2_fc_end_commit_fallback(journal);
 	ext4_fc_update_stats(sb, status, 0, 0, commit_tid);
 	return ret;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 77173ec91e49..18d9d2631559 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1833,7 +1833,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	{}
 };
 
-#define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
 
 #define MOPT_SET	0x0001
 #define MOPT_CLEAR	0x0002
@@ -5211,7 +5210,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 
 	/* Set defaults for the variables that will be set during parsing */
 	if (!(ctx->spec & EXT4_SPEC_JOURNAL_IOPRIO))
-		ctx->journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+		ctx->journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 
 	sbi->s_inode_readahead_blks = EXT4_DEF_INODE_READAHEAD_BLKS;
 	sbi->s_sectors_written_start =
@@ -6471,7 +6470,7 @@ static int __ext4_remount(struct fs_context *fc, struct super_block *sb)
 			ctx->journal_ioprio =
 				sbi->s_journal->j_task->io_context->ioprio;
 		else
-			ctx->journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+			ctx->journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 
 	}
 
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 10/10] ext4: make fast commit ineligible on ext4_reserve_inode_write failure
  2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
                   ` (8 preceding siblings ...)
  2024-05-29  1:20 ` [PATCH v6 09/10] ext4: temporarily elevate commit thread priority Harshad Shirwadkar
@ 2024-05-29  1:20 ` Harshad Shirwadkar
  2024-06-28 14:47   ` Jan Kara
  9 siblings, 1 reply; 26+ messages in thread
From: Harshad Shirwadkar @ 2024-05-29  1:20 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, saukad, harshads, Harshad Shirwadkar

Fast commit by default makes every inode on which
ext4_reserve_inode_write() is called. Thus, if that function
fails for some reason, make the next fast commit ineligible.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/fast_commit.c       |  1 +
 fs/ext4/fast_commit.h       |  1 +
 fs/ext4/inode.c             | 29 ++++++++++++++++++-----------
 include/trace/events/ext4.h |  7 +++++--
 4 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 55a13d3ff681..e7cac190527c 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -2291,6 +2291,7 @@ static const char * const fc_ineligible_reasons[] = {
 	[EXT4_FC_REASON_FALLOC_RANGE] = "Falloc range op",
 	[EXT4_FC_REASON_INODE_JOURNAL_DATA] = "Data journalling",
 	[EXT4_FC_REASON_ENCRYPTED_FILENAME] = "Encrypted filename",
+	[EXT4_FC_REASON_INODE_RSV_WRITE_FAIL] = "Inode reserve write failure"
 };
 
 int ext4_fc_info_show(struct seq_file *seq, void *v)
diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
index 2fadb2c4780c..f7f85c3dd3af 100644
--- a/fs/ext4/fast_commit.h
+++ b/fs/ext4/fast_commit.h
@@ -97,6 +97,7 @@ enum {
 	EXT4_FC_REASON_FALLOC_RANGE,
 	EXT4_FC_REASON_INODE_JOURNAL_DATA,
 	EXT4_FC_REASON_ENCRYPTED_FILENAME,
+	EXT4_FC_REASON_INODE_RSV_WRITE_FAIL,
 	EXT4_FC_REASON_MAX
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f00408017c7a..8fd6e5637542 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5729,20 +5729,27 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 {
 	int err;
 
-	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
-		return -EIO;
+	if (unlikely(ext4_forced_shutdown(inode->i_sb))) {
+		err = -EIO;
+		goto out;
+	}
 
 	err = ext4_get_inode_loc(inode, iloc);
-	if (!err) {
-		BUFFER_TRACE(iloc->bh, "get_write_access");
-		err = ext4_journal_get_write_access(handle, inode->i_sb,
-						    iloc->bh, EXT4_JTR_NONE);
-		if (err) {
-			brelse(iloc->bh);
-			iloc->bh = NULL;
-		}
-		ext4_fc_track_inode(handle, inode);
+	if (err)
+		goto out;
+
+	BUFFER_TRACE(iloc->bh, "get_write_access");
+	err = ext4_journal_get_write_access(handle, inode->i_sb,
+						iloc->bh, EXT4_JTR_NONE);
+	if (err) {
+		brelse(iloc->bh);
+		iloc->bh = NULL;
 	}
+	ext4_fc_track_inode(handle, inode);
+out:
+	if (err)
+		ext4_fc_mark_ineligible(inode->i_sb,
+			EXT4_FC_REASON_INODE_RSV_WRITE_FAIL, handle);
 	ext4_std_error(inode->i_sb, err);
 	return err;
 }
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index a697f4b77162..597845d5c1e8 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -105,6 +105,7 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_RENAME_DIR);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_FALLOC_RANGE);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_INODE_JOURNAL_DATA);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_ENCRYPTED_FILENAME);
+TRACE_DEFINE_ENUM(EXT4_FC_REASON_INODE_RSV_WRITE_FAIL);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 
 #define show_fc_reason(reason)						\
@@ -118,7 +119,8 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 		{ EXT4_FC_REASON_RENAME_DIR,	"RENAME_DIR"},		\
 		{ EXT4_FC_REASON_FALLOC_RANGE,	"FALLOC_RANGE"},	\
 		{ EXT4_FC_REASON_INODE_JOURNAL_DATA,	"INODE_JOURNAL_DATA"}, \
-		{ EXT4_FC_REASON_ENCRYPTED_FILENAME,	"ENCRYPTED_FILENAME"})
+		{ EXT4_FC_REASON_ENCRYPTED_FILENAME,	"ENCRYPTED_FILENAME"}, \
+		{ EXT4_FC_REASON_INODE_RSV_WRITE_FAIL,	"INODE_RSV_WRITE_FAIL"})
 
 TRACE_DEFINE_ENUM(CR_POWER2_ALIGNED);
 TRACE_DEFINE_ENUM(CR_GOAL_LEN_FAST);
@@ -2805,7 +2807,7 @@ TRACE_EVENT(ext4_fc_stats,
 	),
 
 	TP_printk("dev %d,%d fc ineligible reasons:\n"
-		  "%s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u"
+		  "%s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u"
 		  "num_commits:%lu, ineligible: %lu, numblks: %lu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  FC_REASON_NAME_STAT(EXT4_FC_REASON_XATTR),
@@ -2818,6 +2820,7 @@ TRACE_EVENT(ext4_fc_stats,
 		  FC_REASON_NAME_STAT(EXT4_FC_REASON_FALLOC_RANGE),
 		  FC_REASON_NAME_STAT(EXT4_FC_REASON_INODE_JOURNAL_DATA),
 		  FC_REASON_NAME_STAT(EXT4_FC_REASON_ENCRYPTED_FILENAME),
+		  FC_REASON_NAME_STAT(EXT4_FC_REASON_INODE_RSV_WRITE_FAIL),
 		  __entry->fc_commits, __entry->fc_ineligible_commits,
 		  __entry->fc_numblks)
 );
-- 
2.45.1.288.g0e0cd299f1-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 01/10] ext4: convert i_fc_lock to spinlock
  2024-05-29  1:19 ` [PATCH v6 01/10] ext4: convert i_fc_lock to spinlock Harshad Shirwadkar
@ 2024-06-21 16:19   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-06-21 16:19 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:19:54, Harshad Shirwadkar wrote:
> Convert ext4_inode_info->i_fc_lock to spinlock to avoid sleeping
> in invalid contexts.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

							Honza

> ---
>  fs/ext4/ext4.h        |  7 +++++--
>  fs/ext4/fast_commit.c | 24 +++++++++++-------------
>  fs/ext4/super.c       |  2 +-
>  3 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 983dad8c07ec..611b8c80d99c 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1062,8 +1062,11 @@ struct ext4_inode_info {
>  	/* Fast commit wait queue for this inode */
>  	wait_queue_head_t i_fc_wait;
>  
> -	/* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */
> -	struct mutex i_fc_lock;
> +	/*
> +	 * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len
> +	 * and inode's EXT4_FC_STATE_COMMITTING state bit.
> +	 */
> +	spinlock_t i_fc_lock;
>  
>  	/*
>  	 * i_disksize keeps track of what the inode size is ON DISK, not
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 87c009e0c59a..a1aadebfcd66 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -382,7 +382,7 @@ static int ext4_fc_track_template(
>  	int ret;
>  
>  	tid = handle->h_transaction->t_tid;
> -	mutex_lock(&ei->i_fc_lock);
> +	spin_lock(&ei->i_fc_lock);
>  	if (tid == ei->i_sync_tid) {
>  		update = true;
>  	} else {
> @@ -390,7 +390,7 @@ static int ext4_fc_track_template(
>  		ei->i_sync_tid = tid;
>  	}
>  	ret = __fc_track_fn(inode, args, update);
> -	mutex_unlock(&ei->i_fc_lock);
> +	spin_unlock(&ei->i_fc_lock);
>  
>  	if (!enqueue)
>  		return ret;
> @@ -424,19 +424,19 @@ static int __track_dentry_update(struct inode *inode, void *arg, bool update)
>  	struct super_block *sb = inode->i_sb;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  
> -	mutex_unlock(&ei->i_fc_lock);
> +	spin_unlock(&ei->i_fc_lock);
>  
>  	if (IS_ENCRYPTED(dir)) {
>  		ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_ENCRYPTED_FILENAME,
>  					NULL);
> -		mutex_lock(&ei->i_fc_lock);
> +		spin_lock(&ei->i_fc_lock);
>  		return -EOPNOTSUPP;
>  	}
>  
>  	node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
>  	if (!node) {
>  		ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_NOMEM, NULL);
> -		mutex_lock(&ei->i_fc_lock);
> +		spin_lock(&ei->i_fc_lock);
>  		return -ENOMEM;
>  	}
>  
> @@ -448,7 +448,7 @@ static int __track_dentry_update(struct inode *inode, void *arg, bool update)
>  		if (!node->fcd_name.name) {
>  			kmem_cache_free(ext4_fc_dentry_cachep, node);
>  			ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_NOMEM, NULL);
> -			mutex_lock(&ei->i_fc_lock);
> +			spin_lock(&ei->i_fc_lock);
>  			return -ENOMEM;
>  		}
>  		memcpy((u8 *)node->fcd_name.name, dentry->d_name.name,
> @@ -482,7 +482,7 @@ static int __track_dentry_update(struct inode *inode, void *arg, bool update)
>  		list_add_tail(&node->fcd_dilist, &ei->i_fc_dilist);
>  	}
>  	spin_unlock(&sbi->s_fc_lock);
> -	mutex_lock(&ei->i_fc_lock);
> +	spin_lock(&ei->i_fc_lock);
>  
>  	return 0;
>  }
> @@ -614,10 +614,8 @@ static int __track_range(struct inode *inode, void *arg, bool update)
>  	struct __track_range_args *__arg =
>  		(struct __track_range_args *)arg;
>  
> -	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) {
> -		ext4_debug("Special inode %ld being modified\n", inode->i_ino);
> +	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb))
>  		return -ECANCELED;
> -	}
>  
>  	oldstart = ei->i_fc_lblk_start;
>  
> @@ -896,15 +894,15 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
>  	struct ext4_extent *ex;
>  	int ret;
>  
> -	mutex_lock(&ei->i_fc_lock);
> +	spin_lock(&ei->i_fc_lock);
>  	if (ei->i_fc_lblk_len == 0) {
> -		mutex_unlock(&ei->i_fc_lock);
> +		spin_unlock(&ei->i_fc_lock);
>  		return 0;
>  	}
>  	old_blk_size = ei->i_fc_lblk_start;
>  	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
>  	ei->i_fc_lblk_len = 0;
> -	mutex_unlock(&ei->i_fc_lock);
> +	spin_unlock(&ei->i_fc_lock);
>  
>  	cur_lblk_off = old_blk_size;
>  	ext4_debug("will try writing %d to %d for inode %ld\n",
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index f9a4a4e89dac..77173ec91e49 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1436,7 +1436,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>  	atomic_set(&ei->i_unwritten, 0);
>  	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
>  	ext4_fc_init_inode(&ei->vfs_inode);
> -	mutex_init(&ei->i_fc_lock);
> +	spin_lock_init(&ei->i_fc_lock);
>  	return &ei->vfs_inode;
>  }
>  
> -- 
> 2.45.1.288.g0e0cd299f1-goog
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait
  2024-05-29  1:19 ` [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait Harshad Shirwadkar
@ 2024-06-21 16:33   ` Jan Kara
  2024-06-28 14:45   ` Jan Kara
  2024-07-01 22:08   ` Theodore Ts'o
  2 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-06-21 16:33 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:19:55, Harshad Shirwadkar wrote:
> If the inode that's being requested to track using ext4_fc_track_inode
> is being committed, then wait until the inode finishes the
> commit. Also, add calls to ext4_fc_track_inode at the right places.
> 
> With this patch, now calling ext4_reserve_inode_write() results in
> inode being tracked for next fast commit. A subtle lock ordering
> requirement with i_data_sem (which is documented in the code) requires
> that ext4_fc_track_inode() be called before grabbing i_data_sem. So,
> this patch also adds explicit ext4_fc_track_inode() calls in places
> where i_data_sem grabbed.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>  fs/ext4/fast_commit.c | 34 ++++++++++++++++++++++++++++++++++
>  fs/ext4/inline.c      |  3 +++
>  fs/ext4/inode.c       |  4 ++++
>  3 files changed, 41 insertions(+)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index a1aadebfcd66..fa93ce399440 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -581,6 +581,8 @@ static int __track_inode(struct inode *inode, void *arg, bool update)
>  
>  void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
>  {
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	wait_queue_head_t *wq;
>  	int ret;
>  
>  	if (S_ISDIR(inode->i_mode))
> @@ -598,6 +600,38 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
>  	if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
>  		return;
>  
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
> +	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) ||
> +		ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE) ||
> +		!list_empty(&ei->i_fc_list))

Indentation here is wrong (we never indent conditions by the same amount as
subsequent code block). Also EXT4_MF_FC_INELIGIBLE was just tested above so
why repeat it and ext4_fc_disabled() tested the EXT4_FC_REPLAY and
JOURNAL_FAST_COMMIT flags. So list_empty() should be the only thing needed
here.

BTW a separate helper for ext4_fc_disabled() + EXT4_MF_FC_INELIGIBLE test
would be a nice cleanup as it is a pattern happening in quite a few places.

> +		return;
> +
> +	/*
> +	 * If we come here, we may sleep while waiting for the inode to
> +	 * commit. We shouldn't be holding i_data_sem in write mode when we go
> +	 * to sleep since the commit path needs to grab the lock while
> +	 * committing the inode.
> +	 */
> +	WARN_ON(lockdep_is_held_type(&ei->i_data_sem, 1));

Actually even holding it in read mode is problematic. rwsems can block
other readers from acquiring rwsem if there's some writer waiting to
acquire the lock (and that will be blocked behind us). Shortly, this should
be just lockdep_is_held().

Otherwise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 03/10] ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr
  2024-05-29  1:19 ` [PATCH v6 03/10] ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr Harshad Shirwadkar
@ 2024-06-28 13:15   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-06-28 13:15 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:19:56, Harshad Shirwadkar wrote:
> Mark inode dirty first and then grab i_data_sem in ext4_setattr().
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>  fs/ext4/inode.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index aa6440992a55..61ffbdc2fb16 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5410,12 +5410,13 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
>  					inode->i_sb->s_blocksize_bits);
>  
> -			down_write(&EXT4_I(inode)->i_data_sem);
> -			old_disksize = EXT4_I(inode)->i_disksize;
> -			EXT4_I(inode)->i_disksize = attr->ia_size;
>  			rc = ext4_mark_inode_dirty(handle, inode);
>  			if (!error)
>  				error = rc;
> +			down_write(&EXT4_I(inode)->i_data_sem);
> +			old_disksize = EXT4_I(inode)->i_disksize;
> +			EXT4_I(inode)->i_disksize = attr->ia_size;
> +

This is wrong. ext4_mark_inode_dirty() will copy the data to on-disk
buffer. If you set i_disksize after ext4_mark_inode_dirty(), the change
will not get properly logged in the transaction.

What you rather need is calling ext4_mark_inode_dirty() after dropping
i_data_sem which should be completely fine.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 04/10] ext4: rework fast commit commit path
  2024-05-29  1:19 ` [PATCH v6 04/10] ext4: rework fast commit commit path Harshad Shirwadkar
@ 2024-06-28 13:43   ` Jan Kara
  2024-07-13  1:38     ` harshad shirwadkar
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2024-06-28 13:43 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:19:57, Harshad Shirwadkar wrote:
> This patch reworks fast commit's commit path to remove locking the
> journal for the entire duration of a fast commit. Instead, we only lock
> the journal while marking all the eligible inodes as "committing". This
> allows handles to make progress in parallel with the fast commit.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
...
> @@ -1124,6 +1119,20 @@ static int ext4_fc_perform_commit(journal_t *journal)
>  	int ret = 0;
>  	u32 crc = 0;
>  
> +	/*
> +	 * Wait for all the handles of the current transaction to complete
> +	 * and then lock the journal. Since this is essentially the commit
> +	 * path, we don't need to wait for reserved handles.
> +	 */

Here I'd expand the comment to explain better why this is safe. Like:

	/*
	 * Wait for all the handles of the current transaction to complete
	 * and then lock the journal. We don't need to wait for reserved
	 * handles since we only need to set EXT4_STATE_FC_COMMITTING state
	 * while the journal is locked - in particular we don't depend on
	 * page writeback state so there's no risk of deadlocking reserved
	 * handles.
	 */

> +	jbd2_journal_lock_updates_no_rsv(journal);
> +	spin_lock(&sbi->s_fc_lock);
> +	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
> +		ext4_set_inode_state(&iter->vfs_inode,
> +				     EXT4_STATE_FC_COMMITTING);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +	jbd2_journal_unlock_updates(journal);
> +
>  	ret = ext4_fc_submit_inode_data_all(journal);
>  	if (ret)
>  		return ret;
> @@ -1174,6 +1183,18 @@ static int ext4_fc_perform_commit(journal_t *journal)
>  		ret = ext4_fc_write_inode(inode, &crc);
>  		if (ret)
>  			goto out;
> +		ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
> +		/*
> +		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
> +		 * visible before we send the wakeup. Pairs with implicit
> +		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
> +		 */
> +		smp_mb();
> +#if (BITS_PER_LONG < 64)
> +		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
> +#else
> +		wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
> +#endif

Maybe create a helper function for clearing the EXT4_STATE_FC_COMMITTING
bit and waking up the wait queue? It's a bit subtle and used in a few
places.

> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index cb0b8d6fc0c6..4361e5c56490 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -865,25 +865,15 @@ void jbd2_journal_wait_updates(journal_t *journal)
>  	}
>  }
>  
> -/**
> - * jbd2_journal_lock_updates () - establish a transaction barrier.
> - * @journal:  Journal to establish a barrier on.
> - *
> - * This locks out any further updates from being started, and blocks
> - * until all existing updates have completed, returning only once the
> - * journal is in a quiescent state with no updates running.
> - *
> - * The journal lock should not be held on entry.
> - */
> -void jbd2_journal_lock_updates(journal_t *journal)
> +static void __jbd2_journal_lock_updates(journal_t *journal, bool wait_on_rsv)
>  {
>  	jbd2_might_wait_for_commit(journal);
>  
>  	write_lock(&journal->j_state_lock);
>  	++journal->j_barrier_count;
>  
> -	/* Wait until there are no reserved handles */
> -	if (atomic_read(&journal->j_reserved_credits)) {
> +	if (wait_on_rsv && atomic_read(&journal->j_reserved_credits)) {
> +		/* Wait until there are no reserved handles */

So it is not as simple as this. start_this_handle() ignores
journal->j_barrier_count for reserved handles so they would happily start
while you have the journal locked with jbd2_journal_lock_updates_no_rsv()
and then writeback code could mess with your fastcommit state. Or perhaps I
miss some subtlety why this is fine - but that then deserves a good
explanation in a comment or maybe a different API because currently
jbd2_journal_lock_updates_no_rsv() doesn't do what one would naively
expect.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks()
  2024-05-29  1:20 ` [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks() Harshad Shirwadkar
@ 2024-06-28 14:18   ` Jan Kara
  2024-07-13  2:01     ` harshad shirwadkar
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2024-06-28 14:18 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:20:00, Harshad Shirwadkar wrote:
> Add nolock flag to ext4_map_blocks() which skips grabbing
> i_data_sem in ext4_map_blocks. In FC commit path, we first
> mark the inode as committing and thereby prevent any mutations
> on it. Thus, it should be safe to call ext4_map_blocks()
> without i_data_sem in this case. This is a workaround to
> the problem mentioned in RFC V4 version cover letter[1] of this
> patch series which pointed out that there is in incosistency between
> ext4_map_blocks() behavior when EXT4_GET_BLOCKS_CACHED_NOWAIT is
> passed. This patch gets rid of the need to call ext4_map_blocks()
> with EXT4_GET_BLOCKS_CACHED_NOWAIT and instead call it with
> EXT4_GET_BLOCKS_NOLOCK. I verified that generic/311 which failed
> in cached_nowait mode passes with nolock mode.
> 
> [1] https://lwn.net/Articles/902022/
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

I'm sorry I forgot since last time - can you remind me why we cannot we
grab i_data_sem from ext4_fc_write_inode_data()? Because as you write
above, nobody should really be holding that lock while inode is
EXT4_STATE_FC_COMMITTING anyway...

								Honza

> ---
>  fs/ext4/ext4.h        |  1 +
>  fs/ext4/fast_commit.c | 16 ++++++++--------
>  fs/ext4/inode.c       | 14 ++++++++++++--
>  3 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index d802040e94df..196c513f82dd 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -720,6 +720,7 @@ enum {
>  #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
>  	/* Caller is in the atomic contex, find extent if it has been cached */
>  #define EXT4_GET_BLOCKS_CACHED_NOWAIT		0x0800
> +#define EXT4_GET_BLOCKS_NOLOCK			0x1000
>  
>  /*
>   * The bit position of these flags must not overlap with any of the
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index b81b0292aa59..0b7064f8dfa5 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -559,13 +559,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
>  		!list_empty(&ei->i_fc_list))
>  		return;
>  
> -	/*
> -	 * If we come here, we may sleep while waiting for the inode to
> -	 * commit. We shouldn't be holding i_data_sem in write mode when we go
> -	 * to sleep since the commit path needs to grab the lock while
> -	 * committing the inode.
> -	 */
> -	WARN_ON(lockdep_is_held_type(&ei->i_data_sem, 1));
>  
>  	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
>  #if (BITS_PER_LONG < 64)
> @@ -898,7 +891,14 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
>  	while (cur_lblk_off <= new_blk_size) {
>  		map.m_lblk = cur_lblk_off;
>  		map.m_len = new_blk_size - cur_lblk_off + 1;
> -		ret = ext4_map_blocks(NULL, inode, &map, 0);
> +		/*
> +		 * Given that this inode is being committed,
> +		 * EXT4_STATE_FC_COMMITTING is already set on this inode.
> +		 * Which means all the mutations on the inode are paused
> +		 * until the commit operation is complete. Thus it is safe
> +		 * call ext4_map_blocks() in no lock mode.
> +		 */
> +		ret = ext4_map_blocks(NULL, inode, &map, EXT4_GET_BLOCKS_NOLOCK);
>  		if (ret < 0)
>  			return -ECANCELED;
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 61ffbdc2fb16..f00408017c7a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -546,7 +546,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  	 * Try to see if we can get the block without requesting a new
>  	 * file system block.
>  	 */
> -	down_read(&EXT4_I(inode)->i_data_sem);
> +	if (!(flags & EXT4_GET_BLOCKS_NOLOCK))
> +		down_read(&EXT4_I(inode)->i_data_sem);
>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
>  		retval = ext4_ext_map_blocks(handle, inode, map, 0);
>  	} else {
> @@ -573,7 +574,15 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
>  				      map->m_pblk, status);
>  	}
> -	up_read((&EXT4_I(inode)->i_data_sem));
> +	/*
> +	 * We should never call ext4_map_blocks() in nolock mode outside
> +	 * of fast commit path.
> +	 */
> +	WARN_ON((flags & EXT4_GET_BLOCKS_NOLOCK) &&
> +		!ext4_test_inode_state(inode,
> +				       EXT4_STATE_FC_COMMITTING));
> +	if (!(flags & EXT4_GET_BLOCKS_NOLOCK))
> +		up_read((&EXT4_I(inode)->i_data_sem));
>  
>  found:
>  	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> @@ -614,6 +623,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  	 * the write lock of i_data_sem, and call get_block()
>  	 * with create == 1 flag.
>  	 */
> +	WARN_ON((flags & EXT4_GET_BLOCKS_NOLOCK) != 0);
>  	down_write(&EXT4_I(inode)->i_data_sem);
>  
>  	/*
> -- 
> 2.45.1.288.g0e0cd299f1-goog
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 08/10] ext4: introduce selective flushing in fast commit
  2024-05-29  1:20 ` [PATCH v6 08/10] ext4: introduce selective flushing in fast commit Harshad Shirwadkar
@ 2024-06-28 14:33   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-06-28 14:33 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:20:01, Harshad Shirwadkar wrote:
> With fast commits, if the entire commit is contained within a single
> block and there isn't any data that needs a flush, we can avoid sending
> expensive cache flush to disk. Single block metadata only fast commits
> can be written using FUA to guarantee consistency.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>  fs/ext4/ext4.h        | 12 ++++++++++++
>  fs/ext4/ext4_jbd2.h   | 20 ++++++++++++--------
>  fs/ext4/fast_commit.c | 23 ++++++++++++++++++-----
>  3 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 196c513f82dd..3721daea2890 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1744,6 +1744,13 @@ struct ext4_sb_info {
>  					 */
>  	struct list_head s_fc_dentry_q[2];	/* directory entry updates */
>  	unsigned int s_fc_bytes;
> +
> +	/*
> +	 * This flag indicates whether a full flush is needed on
> +	 * next fast commit.
> +	 */
> +	int fc_flush_required;

I think this storing of fastcommit specific info in the superblock is a bad
practice and actually leads to subtle bugs (see below). I believe you
should have a dedicated structure tracking the fast commit info (and you
would actually have two of them - for the running and the committing fast
transaction).

> @@ -2905,6 +2912,11 @@ void ext4_fc_del(struct inode *inode);
>  bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block);
>  void ext4_fc_replay_cleanup(struct super_block *sb);
>  int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
> +static inline void ext4_fc_mark_needs_flush(struct super_block *sb)
> +{
> +	EXT4_SB(sb)->fc_flush_required = 1;
> +}
> +
>  int __init ext4_fc_init_dentry_cache(void);
>  void ext4_fc_destroy_dentry_cache(void);
>  int ext4_fc_record_regions(struct super_block *sb, int ino,
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 0c77697d5e90..e3a4f5c49b6e 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -420,19 +420,23 @@ static inline int ext4_journal_force_commit(journal_t *journal)
>  static inline int ext4_jbd2_inode_add_write(handle_t *handle,
>  		struct inode *inode, loff_t start_byte, loff_t length)
>  {
> -	if (ext4_handle_valid(handle))
> -		return jbd2_journal_inode_ranged_write(handle,
> -				EXT4_I(inode)->jinode, start_byte, length);
> -	return 0;
> +	if (!ext4_handle_valid(handle))
> +		return 0;
> +
> +	ext4_fc_mark_needs_flush(inode->i_sb);
> +	return jbd2_journal_inode_ranged_write(handle,
> +			EXT4_I(inode)->jinode, start_byte, length);
>  }

I think this handling of fc_flush_required introduces a subtle bug. While
fast commit is running, next transaction can be already running in parallel
and thus set fc_flush_required = 1. When fast commit completes, it does
cache flush and sets fc_flush_required = 0. But the data added here in
ext4_jbd2_inode_add_write() is not written out yet so the cache flush
didn't include them and the next fast commit need not flush caches causing
subtle data integrity issues after power failure.

I actually think it will be much less error prone if you track whether we
need to flush or not while writing out the fast commit to the journal. No
need to track it early when things are just being added to the transaction.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 09/10] ext4: temporarily elevate commit thread priority
  2024-05-29  1:20 ` [PATCH v6 09/10] ext4: temporarily elevate commit thread priority Harshad Shirwadkar
@ 2024-06-28 14:42   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-06-28 14:42 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:20:02, Harshad Shirwadkar wrote:
> Unlike JBD2 based full commits, there is no dedicated journal thread
> for fast commits. Thus to reduce scheduling delays between IO
> submission and completion, temporarily elevate the committer thread's
> priority to match the configured priority of the JBD2 journal
> thread.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

This makes some sense although I'd note that io priority is getting less
and less use these days since IO is now mostly controlled through cgroup
controllers and they don't give a damn about IO priority. E.g. blk-iocost
controller uses bio_issue_as_root_blkg() (which boils down to bio->bi_opf &
(REQ_META | REQ_SWAP)) to determine whether it should avoid throttling IOs
to avoid priority inversion (exactly the case of fast-commit). So I think
properly annotating journal IO with REQ_META will bring much more tangible
benefit in common configurations that bother to control IO and then this
needn't be even needed. But I'm not really opposed either so feel free to
add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4.h        |  4 +++-
>  fs/ext4/fast_commit.c | 13 +++++++++++++
>  fs/ext4/super.c       |  5 ++---
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3721daea2890..d52df8a85271 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2287,10 +2287,12 @@ static inline int ext4_forced_shutdown(struct super_block *sb)
>  #define EXT4_DEFM_NODELALLOC	0x0800
>  
>  /*
> - * Default journal batch times
> + * Default journal batch times and ioprio.
>   */
>  #define EXT4_DEF_MIN_BATCH_TIME	0
>  #define EXT4_DEF_MAX_BATCH_TIME	15000 /* 15ms */
> +#define EXT4_DEF_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
> +
>  
>  /*
>   * Minimum number of groups in a flexgroup before we separate out
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 35c89bee452c..55a13d3ff681 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -1205,6 +1205,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
>  	int subtid = atomic_read(&sbi->s_fc_subtid);
>  	int status = EXT4_FC_STATUS_OK, fc_bufs_before = 0;
>  	ktime_t start_time, commit_time;
> +	int old_ioprio, journal_ioprio;
>  
>  	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
>  		return jbd2_complete_transaction(journal, commit_tid);
> @@ -1212,6 +1213,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
>  	trace_ext4_fc_commit_start(sb, commit_tid);
>  
>  	start_time = ktime_get();
> +	old_ioprio = get_current_ioprio();
>  
>  restart_fc:
>  	ret = jbd2_fc_begin_commit(journal, commit_tid);
> @@ -1242,6 +1244,15 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
>  		goto fallback;
>  	}
>  
> +	/*
> +	 * Now that we know that this thread is going to do a fast commit,
> +	 * elevate the priority to match that of the journal thread.
> +	 */
> +	if (journal->j_task->io_context)
> +		journal_ioprio = sbi->s_journal->j_task->io_context->ioprio;
> +	else
> +		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
> +	set_task_ioprio(current, journal_ioprio);
>  	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
>  	ret = ext4_fc_perform_commit(journal);
>  	if (ret < 0) {
> @@ -1256,6 +1267,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
>  	}
>  	atomic_inc(&sbi->s_fc_subtid);
>  	ret = jbd2_fc_end_commit(journal);
> +	set_task_ioprio(current, old_ioprio);
>  	/*
>  	 * weight the commit time higher than the average time so we
>  	 * don't react too strongly to vast changes in the commit time
> @@ -1265,6 +1277,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
>  	return ret;
>  
>  fallback:
> +	set_task_ioprio(current, old_ioprio);
>  	ret = jbd2_fc_end_commit_fallback(journal);
>  	ext4_fc_update_stats(sb, status, 0, 0, commit_tid);
>  	return ret;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 77173ec91e49..18d9d2631559 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1833,7 +1833,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
>  	{}
>  };
>  
> -#define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
>  
>  #define MOPT_SET	0x0001
>  #define MOPT_CLEAR	0x0002
> @@ -5211,7 +5210,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  
>  	/* Set defaults for the variables that will be set during parsing */
>  	if (!(ctx->spec & EXT4_SPEC_JOURNAL_IOPRIO))
> -		ctx->journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> +		ctx->journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
>  
>  	sbi->s_inode_readahead_blks = EXT4_DEF_INODE_READAHEAD_BLKS;
>  	sbi->s_sectors_written_start =
> @@ -6471,7 +6470,7 @@ static int __ext4_remount(struct fs_context *fc, struct super_block *sb)
>  			ctx->journal_ioprio =
>  				sbi->s_journal->j_task->io_context->ioprio;
>  		else
> -			ctx->journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> +			ctx->journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
>  
>  	}
>  
> -- 
> 2.45.1.288.g0e0cd299f1-goog
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait
  2024-05-29  1:19 ` [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait Harshad Shirwadkar
  2024-06-21 16:33   ` Jan Kara
@ 2024-06-28 14:45   ` Jan Kara
  2024-07-01 22:08   ` Theodore Ts'o
  2 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-06-28 14:45 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:19:55, Harshad Shirwadkar wrote:
> If the inode that's being requested to track using ext4_fc_track_inode
> is being committed, then wait until the inode finishes the
> commit. Also, add calls to ext4_fc_track_inode at the right places.
> 
> With this patch, now calling ext4_reserve_inode_write() results in
> inode being tracked for next fast commit. A subtle lock ordering
> requirement with i_data_sem (which is documented in the code) requires
> that ext4_fc_track_inode() be called before grabbing i_data_sem. So,
> this patch also adds explicit ext4_fc_track_inode() calls in places
> where i_data_sem grabbed.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

One more thing I've noticed:

> @@ -5727,6 +5730,7 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
>  			brelse(iloc->bh);
>  			iloc->bh = NULL;
>  		}
> +		ext4_fc_track_inode(handle, inode);
>  	}
>  	ext4_std_error(inode->i_sb, err);
>  	return err;

Calling ext4_fc_track_inode() when ext4_get_write_access() failed is
pointless (inode isn't going to be written) and confusing. We should do
that only in the success case.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 10/10] ext4: make fast commit ineligible on ext4_reserve_inode_write failure
  2024-05-29  1:20 ` [PATCH v6 10/10] ext4: make fast commit ineligible on ext4_reserve_inode_write failure Harshad Shirwadkar
@ 2024-06-28 14:47   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-06-28 14:47 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, saukad, harshads

On Wed 29-05-24 01:20:03, Harshad Shirwadkar wrote:
> Fast commit by default makes every inode on which
> ext4_reserve_inode_write() is called. Thus, if that function
> fails for some reason, make the next fast commit ineligible.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

Yeah, makes sense. The hunk in ext4_reserve_inode_write() will need redoing
once you fix the problem I've pointed out in patch 2 but otherwise the
patch looks good.

								Honza

> ---
>  fs/ext4/fast_commit.c       |  1 +
>  fs/ext4/fast_commit.h       |  1 +
>  fs/ext4/inode.c             | 29 ++++++++++++++++++-----------
>  include/trace/events/ext4.h |  7 +++++--
>  4 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 55a13d3ff681..e7cac190527c 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -2291,6 +2291,7 @@ static const char * const fc_ineligible_reasons[] = {
>  	[EXT4_FC_REASON_FALLOC_RANGE] = "Falloc range op",
>  	[EXT4_FC_REASON_INODE_JOURNAL_DATA] = "Data journalling",
>  	[EXT4_FC_REASON_ENCRYPTED_FILENAME] = "Encrypted filename",
> +	[EXT4_FC_REASON_INODE_RSV_WRITE_FAIL] = "Inode reserve write failure"
>  };
>  
>  int ext4_fc_info_show(struct seq_file *seq, void *v)
> diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> index 2fadb2c4780c..f7f85c3dd3af 100644
> --- a/fs/ext4/fast_commit.h
> +++ b/fs/ext4/fast_commit.h
> @@ -97,6 +97,7 @@ enum {
>  	EXT4_FC_REASON_FALLOC_RANGE,
>  	EXT4_FC_REASON_INODE_JOURNAL_DATA,
>  	EXT4_FC_REASON_ENCRYPTED_FILENAME,
> +	EXT4_FC_REASON_INODE_RSV_WRITE_FAIL,
>  	EXT4_FC_REASON_MAX
>  };
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f00408017c7a..8fd6e5637542 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5729,20 +5729,27 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
>  {
>  	int err;
>  
> -	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
> -		return -EIO;
> +	if (unlikely(ext4_forced_shutdown(inode->i_sb))) {
> +		err = -EIO;
> +		goto out;
> +	}
>  
>  	err = ext4_get_inode_loc(inode, iloc);
> -	if (!err) {
> -		BUFFER_TRACE(iloc->bh, "get_write_access");
> -		err = ext4_journal_get_write_access(handle, inode->i_sb,
> -						    iloc->bh, EXT4_JTR_NONE);
> -		if (err) {
> -			brelse(iloc->bh);
> -			iloc->bh = NULL;
> -		}
> -		ext4_fc_track_inode(handle, inode);
> +	if (err)
> +		goto out;
> +
> +	BUFFER_TRACE(iloc->bh, "get_write_access");
> +	err = ext4_journal_get_write_access(handle, inode->i_sb,
> +						iloc->bh, EXT4_JTR_NONE);
> +	if (err) {
> +		brelse(iloc->bh);
> +		iloc->bh = NULL;
>  	}
> +	ext4_fc_track_inode(handle, inode);
> +out:
> +	if (err)
> +		ext4_fc_mark_ineligible(inode->i_sb,
> +			EXT4_FC_REASON_INODE_RSV_WRITE_FAIL, handle);
>  	ext4_std_error(inode->i_sb, err);
>  	return err;
>  }
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index a697f4b77162..597845d5c1e8 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -105,6 +105,7 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_RENAME_DIR);
>  TRACE_DEFINE_ENUM(EXT4_FC_REASON_FALLOC_RANGE);
>  TRACE_DEFINE_ENUM(EXT4_FC_REASON_INODE_JOURNAL_DATA);
>  TRACE_DEFINE_ENUM(EXT4_FC_REASON_ENCRYPTED_FILENAME);
> +TRACE_DEFINE_ENUM(EXT4_FC_REASON_INODE_RSV_WRITE_FAIL);
>  TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
>  
>  #define show_fc_reason(reason)						\
> @@ -118,7 +119,8 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
>  		{ EXT4_FC_REASON_RENAME_DIR,	"RENAME_DIR"},		\
>  		{ EXT4_FC_REASON_FALLOC_RANGE,	"FALLOC_RANGE"},	\
>  		{ EXT4_FC_REASON_INODE_JOURNAL_DATA,	"INODE_JOURNAL_DATA"}, \
> -		{ EXT4_FC_REASON_ENCRYPTED_FILENAME,	"ENCRYPTED_FILENAME"})
> +		{ EXT4_FC_REASON_ENCRYPTED_FILENAME,	"ENCRYPTED_FILENAME"}, \
> +		{ EXT4_FC_REASON_INODE_RSV_WRITE_FAIL,	"INODE_RSV_WRITE_FAIL"})
>  
>  TRACE_DEFINE_ENUM(CR_POWER2_ALIGNED);
>  TRACE_DEFINE_ENUM(CR_GOAL_LEN_FAST);
> @@ -2805,7 +2807,7 @@ TRACE_EVENT(ext4_fc_stats,
>  	),
>  
>  	TP_printk("dev %d,%d fc ineligible reasons:\n"
> -		  "%s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u"
> +		  "%s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u, %s:%u"
>  		  "num_commits:%lu, ineligible: %lu, numblks: %lu",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  FC_REASON_NAME_STAT(EXT4_FC_REASON_XATTR),
> @@ -2818,6 +2820,7 @@ TRACE_EVENT(ext4_fc_stats,
>  		  FC_REASON_NAME_STAT(EXT4_FC_REASON_FALLOC_RANGE),
>  		  FC_REASON_NAME_STAT(EXT4_FC_REASON_INODE_JOURNAL_DATA),
>  		  FC_REASON_NAME_STAT(EXT4_FC_REASON_ENCRYPTED_FILENAME),
> +		  FC_REASON_NAME_STAT(EXT4_FC_REASON_INODE_RSV_WRITE_FAIL),
>  		  __entry->fc_commits, __entry->fc_ineligible_commits,
>  		  __entry->fc_numblks)
>  );
> -- 
> 2.45.1.288.g0e0cd299f1-goog
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait
  2024-05-29  1:19 ` [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait Harshad Shirwadkar
  2024-06-21 16:33   ` Jan Kara
  2024-06-28 14:45   ` Jan Kara
@ 2024-07-01 22:08   ` Theodore Ts'o
  2024-07-12 17:09     ` harshad shirwadkar
  2 siblings, 1 reply; 26+ messages in thread
From: Theodore Ts'o @ 2024-07-01 22:08 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, saukad, harshads

On Wed, May 29, 2024 at 01:19:55AM +0000, Harshad Shirwadkar wrote:
> If the inode that's being requested to track using ext4_fc_track_inode
> is being committed, then wait until the inode finishes the
> commit. Also, add calls to ext4_fc_track_inode at the right places.
> 
> With this patch, now calling ext4_reserve_inode_write() results in
> inode being tracked for next fast commit. A subtle lock ordering
> requirement with i_data_sem (which is documented in the code) requires
> that ext4_fc_track_inode() be called before grabbing i_data_sem. So,
> this patch also adds explicit ext4_fc_track_inode() calls in places
> where i_data_sem grabbed.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

I tried applying this patchset to both the current ext4/dev branch as
well as on to 6.10-rc5, and generic/241 is triggering large series of
WARNINGS followed by a BUG (or in some cases, a soft lockup).  A
bisection leads me to this patch.

The WARN stack trace:

[    4.061189] ------------[ cut here ]------------
[    4.061848] WARNING: CPU: 1 PID: 2627 at fs/ext4/fast_commit.c:259 ext4_fc_del+0x7d/0x190
[    4.062919] CPU: 1 PID: 2627 Comm: dbench Not tainted 6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350
[    4.064070] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    4.065077] RIP: 0010:ext4_fc_del+0x7d/0x190
[    4.065404] Code: 0f 84 f0 00 00 00 48 8b 83 48 ff ff ff 48 0f ba e0 2a 73 18 48 8b 43 28 48 8b 80 90 03 00 00 48 8b 80 80 00 00 00 a8 02 75 02 <0f> 0b 48 89 ef e8 09 ad 68 00 84 c0 74 0f 48 8b 53 98 48 8b 43 a0
[    4.066190] RSP: 0018:ffffc90003707d98 EFLAGS: 00010246
[    4.066415] RAX: 0000000000000001 RBX: ffff888013de5c08 RCX: 0000000000000000
[    4.066718] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffff88800a22c7f0
[    4.067019] RBP: ffff888013de5ba0 R08: ffffffff827fc6fe R09: ffff888008bed000
[    4.067323] R10: 0000000000000008 R11: 000000000000001e R12: ffff88800a22c7f0
[    4.067629] R13: ffff88800a22c000 R14: ffff888013de5b90 R15: ffff88800ac0c000
[    4.067935] FS:  00007fec79a4e740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[    4.068281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.068530] CR2: 000055d2f34b87e8 CR3: 000000000f492003 CR4: 0000000000770ef0
[    4.068834] PKRU: 55555554
[    4.068952] Call Trace:
[    4.069061]  <TASK>
[    4.069158]  ? __warn+0x7b/0x120
[    4.069338]  ? ext4_fc_del+0x7d/0x190
[    4.069497]  ? report_bug+0x174/0x1c0
[    4.069655]  ? handle_bug+0x3a/0x70
[    4.069809]  ? exc_invalid_op+0x17/0x70
[    4.069975]  ? asm_exc_invalid_op+0x1a/0x20
[    4.070156]  ? ext4_fc_del+0x7d/0x190
[    4.070309]  ? ext4_fc_del+0x44/0x190
[    4.070468]  ext4_clear_inode+0x12/0xb0
[    4.070636]  ext4_free_inode+0x86/0x5a0
[    4.070802]  ext4_evict_inode+0x457/0x6b0
[    4.070976]  evict+0xcd/0x1d0
[    4.071114]  do_unlinkat+0x2de/0x330
[    4.071271]  __x64_sys_unlink+0x23/0x30
[    4.071436]  do_syscall_64+0x4b/0x110
[    4.071596]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    4.071812] RIP: 0033:0x7fec79b4aa07
[    4.071969] Code: f0 ff ff 73 01 c3 48 8b 0d f6 83 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 83 0d 00 f7 d8 64 89 01 48
[    4.072760] RSP: 002b:00007ffed55de918 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[    4.073081] RAX: ffffffffffffffda RBX: 00007ffed55dedb0 RCX: 00007fec79b4aa07
[    4.073429] RDX: 0000000000000000 RSI: 00007ffed55dedb0 RDI: 00007ffed55dedb0
[    4.073733] RBP: 000055d2f34a37f0 R08: 0fffffffffffffff R09: 0000000000000000
[    4.074033] R10: 0000000000000000 R11: 0000000000000206 R12: 000055d2f34a35d0
[    4.074338] R13: 00007fec79c35000 R14: 0000000000000004 R15: 00007fec79c35050
[    4.074650]  </TASK>
[    4.074747] ---[ end trace 0000000000000000 ]---

And here's the BUG:

[    5.121989] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[    5.122281] #PF: supervisor read access in kernel mode
[    5.122500] #PF: error_code(0x0000) - not-present page
[    5.122717] PGD 0 P4D 0 
[    5.122828] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[    5.123036] CPU: 0 PID: 2629 Comm: dbench Tainted: G        W          6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350
[    5.123470] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    5.123857] RIP: 0010:ext4_fc_perform_commit+0x303/0x4b0
[    5.124082] Code: fd 48 2d a0 00 00 00 48 39 d3 75 af e9 ac fe ff ff 49 8b 4d 58 49 8d 45 58 48 39 c1 0f 84 9f 01 00 00 49 8b 45 58 49 63 4d 08 <48> 39 88 b8 00 00 00 4c 8d 78 78 0f 85 7f 01 00 00 48 89 ef e8 c4
[    5.124855] RSP: 0018:ffffc90003727df8 EFLAGS: 00010246
[    5.125072] RAX: 0000000000000000 RBX: ffff8880089e5f08 RCX: 00000000089e5f08
[    5.125416] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88800a22c7f0
[    5.125712] RBP: ffff88800a22c7f0 R08: ffff88807dc2fbc0 R09: 0000000000000000
[    5.126009] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88800a22c7c8
[    5.126310] R13: ffff8880089e5f08 R14: ffff88800a22c7a8 R15: ffff88800a22c708
[    5.126609] FS:  00007fec79a4e740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[    5.126943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.127188] CR2: 00000000000000b8 CR3: 000000000f48e002 CR4: 0000000000770ef0
[    5.127483] PKRU: 55555554
[    5.127598] Call Trace:
[    5.127705]  <TASK>
[    5.127838]  ? __die+0x23/0x60
[    5.127973]  ? page_fault_oops+0xa3/0x160
[    5.128143]  ? exc_page_fault+0x6a/0x160
[    5.128351]  ? asm_exc_page_fault+0x26/0x30
[    5.128530]  ? ext4_fc_perform_commit+0x303/0x4b0
[    5.128728]  ? ext4_fc_perform_commit+0x36b/0x4b0
[    5.128948]  ext4_fc_commit+0x17f/0x300
[    5.129116]  ext4_sync_file+0x1ce/0x380
[    5.129310]  __x64_sys_fsync+0x3b/0x70
[    5.129470]  do_syscall_64+0x4b/0x110
[    5.129627]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    5.129840] RIP: 0033:0x7fec79b4fb10
[    5.129995] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d d1 ba 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
[    5.130764] RSP: 002b:00007ffed55de948 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
[    5.131079] RAX: ffffffffffffffda RBX: 00007fec79c35450 RCX: 00007fec79b4fb10
[    5.131381] RDX: 0000000000002b6f RSI: 0000000000002b6f RDI: 0000000000000005
[    5.131681] RBP: 000055d2f34a3660 R08: 1999999999999999 R09: 0000000000000000
[    5.131981] R10: 00007fec79bcdac0 R11: 0000000000000202 R12: 000055d2f34a35d0
[    5.132280] R13: 00007fec79c35450 R14: 0000000000000003 R15: 00007fec79c354a0
[    5.132575]  </TASK>
[    5.132670] CR2: 00000000000000b8
[    5.132812] ---[ end trace 0000000000000000 ]---

Harshad, can you take a look?   Thanks!

						- Ted


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait
  2024-07-01 22:08   ` Theodore Ts'o
@ 2024-07-12 17:09     ` harshad shirwadkar
  0 siblings, 0 replies; 26+ messages in thread
From: harshad shirwadkar @ 2024-07-12 17:09 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, saukad, harshads

Thank you for the review, Jan and the bug report Ted. I think I have
found the issue here.

In fast commit code, we are using s_fc_lock to protect additions and
deletions to lists s_fc_q and s_fc_dentry_q. However, since s_fc_lock
is a spin_lock, we have to give it up in many places where we do
things like memory allocations / deletions, since those cannot be
performed in an atomic context. I am realizing that unlocking that
lock in the middle of the list traversal to allow it to perform these
non-atomic tasks is opening a door to a lot of subtle concurrency
bugs. Unlocking this lock in the middle of the traversal leaves the
door open to ext4_fc_del, which may just remove an entry from the
list, and leave the initial traversal in a broken state. So as an
immediate remedy, I am thinking that we can convert s_fc_lock to mutex
so that the commit code doesn't have to leave the lock in the middle
of the loop.

In the long run, we need to revisit the whole staging queue / main
queue design. I think we can just simplify that such that there is
really only one queue. Commit code just copies the queue in a local
variable and initializes sbi->s_fc_q to empty. That would get rid of
all the messy "if commit ongoing insert in staging otherwise insert in
main" conditions and also simplify the overall code.

I'll make the change to convert to mutex and handle all other Jan's
comments (thanks Jan for the detailed feedback on other patches) to
submit V7 of this patch series.

Thank you,
Harshad



On Mon, Jul 1, 2024 at 3:08 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, May 29, 2024 at 01:19:55AM +0000, Harshad Shirwadkar wrote:
> > If the inode that's being requested to track using ext4_fc_track_inode
> > is being committed, then wait until the inode finishes the
> > commit. Also, add calls to ext4_fc_track_inode at the right places.
> >
> > With this patch, now calling ext4_reserve_inode_write() results in
> > inode being tracked for next fast commit. A subtle lock ordering
> > requirement with i_data_sem (which is documented in the code) requires
> > that ext4_fc_track_inode() be called before grabbing i_data_sem. So,
> > this patch also adds explicit ext4_fc_track_inode() calls in places
> > where i_data_sem grabbed.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
>
> I tried applying this patchset to both the current ext4/dev branch as
> well as on to 6.10-rc5, and generic/241 is triggering large series of
> WARNINGS followed by a BUG (or in some cases, a soft lockup).  A
> bisection leads me to this patch.
>
> The WARN stack trace:
>
> [    4.061189] ------------[ cut here ]------------
> [    4.061848] WARNING: CPU: 1 PID: 2627 at fs/ext4/fast_commit.c:259 ext4_fc_del+0x7d/0x190
> [    4.062919] CPU: 1 PID: 2627 Comm: dbench Not tainted 6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350
> [    4.064070] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [    4.065077] RIP: 0010:ext4_fc_del+0x7d/0x190
> [    4.065404] Code: 0f 84 f0 00 00 00 48 8b 83 48 ff ff ff 48 0f ba e0 2a 73 18 48 8b 43 28 48 8b 80 90 03 00 00 48 8b 80 80 00 00 00 a8 02 75 02 <0f> 0b 48 89 ef e8 09 ad 68 00 84 c0 74 0f 48 8b 53 98 48 8b 43 a0
> [    4.066190] RSP: 0018:ffffc90003707d98 EFLAGS: 00010246
> [    4.066415] RAX: 0000000000000001 RBX: ffff888013de5c08 RCX: 0000000000000000
> [    4.066718] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffff88800a22c7f0
> [    4.067019] RBP: ffff888013de5ba0 R08: ffffffff827fc6fe R09: ffff888008bed000
> [    4.067323] R10: 0000000000000008 R11: 000000000000001e R12: ffff88800a22c7f0
> [    4.067629] R13: ffff88800a22c000 R14: ffff888013de5b90 R15: ffff88800ac0c000
> [    4.067935] FS:  00007fec79a4e740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
> [    4.068281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    4.068530] CR2: 000055d2f34b87e8 CR3: 000000000f492003 CR4: 0000000000770ef0
> [    4.068834] PKRU: 55555554
> [    4.068952] Call Trace:
> [    4.069061]  <TASK>
> [    4.069158]  ? __warn+0x7b/0x120
> [    4.069338]  ? ext4_fc_del+0x7d/0x190
> [    4.069497]  ? report_bug+0x174/0x1c0
> [    4.069655]  ? handle_bug+0x3a/0x70
> [    4.069809]  ? exc_invalid_op+0x17/0x70
> [    4.069975]  ? asm_exc_invalid_op+0x1a/0x20
> [    4.070156]  ? ext4_fc_del+0x7d/0x190
> [    4.070309]  ? ext4_fc_del+0x44/0x190
> [    4.070468]  ext4_clear_inode+0x12/0xb0
> [    4.070636]  ext4_free_inode+0x86/0x5a0
> [    4.070802]  ext4_evict_inode+0x457/0x6b0
> [    4.070976]  evict+0xcd/0x1d0
> [    4.071114]  do_unlinkat+0x2de/0x330
> [    4.071271]  __x64_sys_unlink+0x23/0x30
> [    4.071436]  do_syscall_64+0x4b/0x110
> [    4.071596]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [    4.071812] RIP: 0033:0x7fec79b4aa07
> [    4.071969] Code: f0 ff ff 73 01 c3 48 8b 0d f6 83 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 83 0d 00 f7 d8 64 89 01 48
> [    4.072760] RSP: 002b:00007ffed55de918 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
> [    4.073081] RAX: ffffffffffffffda RBX: 00007ffed55dedb0 RCX: 00007fec79b4aa07
> [    4.073429] RDX: 0000000000000000 RSI: 00007ffed55dedb0 RDI: 00007ffed55dedb0
> [    4.073733] RBP: 000055d2f34a37f0 R08: 0fffffffffffffff R09: 0000000000000000
> [    4.074033] R10: 0000000000000000 R11: 0000000000000206 R12: 000055d2f34a35d0
> [    4.074338] R13: 00007fec79c35000 R14: 0000000000000004 R15: 00007fec79c35050
> [    4.074650]  </TASK>
> [    4.074747] ---[ end trace 0000000000000000 ]---
>
> And here's the BUG:
>
> [    5.121989] BUG: kernel NULL pointer dereference, address: 00000000000000b8
> [    5.122281] #PF: supervisor read access in kernel mode
> [    5.122500] #PF: error_code(0x0000) - not-present page
> [    5.122717] PGD 0 P4D 0
> [    5.122828] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [    5.123036] CPU: 0 PID: 2629 Comm: dbench Tainted: G        W          6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350
> [    5.123470] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [    5.123857] RIP: 0010:ext4_fc_perform_commit+0x303/0x4b0
> [    5.124082] Code: fd 48 2d a0 00 00 00 48 39 d3 75 af e9 ac fe ff ff 49 8b 4d 58 49 8d 45 58 48 39 c1 0f 84 9f 01 00 00 49 8b 45 58 49 63 4d 08 <48> 39 88 b8 00 00 00 4c 8d 78 78 0f 85 7f 01 00 00 48 89 ef e8 c4
> [    5.124855] RSP: 0018:ffffc90003727df8 EFLAGS: 00010246
> [    5.125072] RAX: 0000000000000000 RBX: ffff8880089e5f08 RCX: 00000000089e5f08
> [    5.125416] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88800a22c7f0
> [    5.125712] RBP: ffff88800a22c7f0 R08: ffff88807dc2fbc0 R09: 0000000000000000
> [    5.126009] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88800a22c7c8
> [    5.126310] R13: ffff8880089e5f08 R14: ffff88800a22c7a8 R15: ffff88800a22c708
> [    5.126609] FS:  00007fec79a4e740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> [    5.126943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.127188] CR2: 00000000000000b8 CR3: 000000000f48e002 CR4: 0000000000770ef0
> [    5.127483] PKRU: 55555554
> [    5.127598] Call Trace:
> [    5.127705]  <TASK>
> [    5.127838]  ? __die+0x23/0x60
> [    5.127973]  ? page_fault_oops+0xa3/0x160
> [    5.128143]  ? exc_page_fault+0x6a/0x160
> [    5.128351]  ? asm_exc_page_fault+0x26/0x30
> [    5.128530]  ? ext4_fc_perform_commit+0x303/0x4b0
> [    5.128728]  ? ext4_fc_perform_commit+0x36b/0x4b0
> [    5.128948]  ext4_fc_commit+0x17f/0x300
> [    5.129116]  ext4_sync_file+0x1ce/0x380
> [    5.129310]  __x64_sys_fsync+0x3b/0x70
> [    5.129470]  do_syscall_64+0x4b/0x110
> [    5.129627]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [    5.129840] RIP: 0033:0x7fec79b4fb10
> [    5.129995] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d d1 ba 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
> [    5.130764] RSP: 002b:00007ffed55de948 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
> [    5.131079] RAX: ffffffffffffffda RBX: 00007fec79c35450 RCX: 00007fec79b4fb10
> [    5.131381] RDX: 0000000000002b6f RSI: 0000000000002b6f RDI: 0000000000000005
> [    5.131681] RBP: 000055d2f34a3660 R08: 1999999999999999 R09: 0000000000000000
> [    5.131981] R10: 00007fec79bcdac0 R11: 0000000000000202 R12: 000055d2f34a35d0
> [    5.132280] R13: 00007fec79c35450 R14: 0000000000000003 R15: 00007fec79c354a0
> [    5.132575]  </TASK>
> [    5.132670] CR2: 00000000000000b8
> [    5.132812] ---[ end trace 0000000000000000 ]---
>
> Harshad, can you take a look?   Thanks!
>
>                                                 - Ted
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 04/10] ext4: rework fast commit commit path
  2024-06-28 13:43   ` Jan Kara
@ 2024-07-13  1:38     ` harshad shirwadkar
  2024-07-17 12:11       ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: harshad shirwadkar @ 2024-07-13  1:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, tytso, saukad, harshads

On Fri, Jun 28, 2024 at 6:43 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 29-05-24 01:19:57, Harshad Shirwadkar wrote:
> > This patch reworks fast commit's commit path to remove locking the
> > journal for the entire duration of a fast commit. Instead, we only lock
> > the journal while marking all the eligible inodes as "committing". This
> > allows handles to make progress in parallel with the fast commit.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ...
> > @@ -1124,6 +1119,20 @@ static int ext4_fc_perform_commit(journal_t *journal)
> >       int ret = 0;
> >       u32 crc = 0;
> >
> > +     /*
> > +      * Wait for all the handles of the current transaction to complete
> > +      * and then lock the journal. Since this is essentially the commit
> > +      * path, we don't need to wait for reserved handles.
> > +      */
>
> Here I'd expand the comment to explain better why this is safe. Like:
>
>         /*
>          * Wait for all the handles of the current transaction to complete
>          * and then lock the journal. We don't need to wait for reserved
>          * handles since we only need to set EXT4_STATE_FC_COMMITTING state
>          * while the journal is locked - in particular we don't depend on
>          * page writeback state so there's no risk of deadlocking reserved
>          * handles.
>          */
>
> > +     jbd2_journal_lock_updates_no_rsv(journal);
> > +     spin_lock(&sbi->s_fc_lock);
> > +     list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
> > +             ext4_set_inode_state(&iter->vfs_inode,
> > +                                  EXT4_STATE_FC_COMMITTING);
> > +     }
> > +     spin_unlock(&sbi->s_fc_lock);
> > +     jbd2_journal_unlock_updates(journal);
> > +
> >       ret = ext4_fc_submit_inode_data_all(journal);
> >       if (ret)
> >               return ret;
> > @@ -1174,6 +1183,18 @@ static int ext4_fc_perform_commit(journal_t *journal)
> >               ret = ext4_fc_write_inode(inode, &crc);
> >               if (ret)
> >                       goto out;
> > +             ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
> > +             /*
> > +              * Make sure clearing of EXT4_STATE_FC_COMMITTING is
> > +              * visible before we send the wakeup. Pairs with implicit
> > +              * barrier in prepare_to_wait() in ext4_fc_track_inode().
> > +              */
> > +             smp_mb();
> > +#if (BITS_PER_LONG < 64)
> > +             wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
> > +#else
> > +             wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
> > +#endif
>
> Maybe create a helper function for clearing the EXT4_STATE_FC_COMMITTING
> bit and waking up the wait queue? It's a bit subtle and used in a few
> places.
>
> > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > index cb0b8d6fc0c6..4361e5c56490 100644
> > --- a/fs/jbd2/transaction.c
> > +++ b/fs/jbd2/transaction.c
> > @@ -865,25 +865,15 @@ void jbd2_journal_wait_updates(journal_t *journal)
> >       }
> >  }
> >
> > -/**
> > - * jbd2_journal_lock_updates () - establish a transaction barrier.
> > - * @journal:  Journal to establish a barrier on.
> > - *
> > - * This locks out any further updates from being started, and blocks
> > - * until all existing updates have completed, returning only once the
> > - * journal is in a quiescent state with no updates running.
> > - *
> > - * The journal lock should not be held on entry.
> > - */
> > -void jbd2_journal_lock_updates(journal_t *journal)
> > +static void __jbd2_journal_lock_updates(journal_t *journal, bool wait_on_rsv)
> >  {
> >       jbd2_might_wait_for_commit(journal);
> >
> >       write_lock(&journal->j_state_lock);
> >       ++journal->j_barrier_count;
> >
> > -     /* Wait until there are no reserved handles */
> > -     if (atomic_read(&journal->j_reserved_credits)) {
> > +     if (wait_on_rsv && atomic_read(&journal->j_reserved_credits)) {
> > +             /* Wait until there are no reserved handles */
>
> So it is not as simple as this. start_this_handle() ignores
> journal->j_barrier_count for reserved handles so they would happily start
> while you have the journal locked with jbd2_journal_lock_updates_no_rsv()
> and then writeback code could mess with your fastcommit state. Or perhaps I
> miss some subtlety why this is fine - but that then deserves a good
> explanation in a comment or maybe a different API because currently
> jbd2_journal_lock_updates_no_rsv() doesn't do what one would naively
> expect.

AFAICT, jbd2_journal_commit_transaction() only calls
jbd2_journal_wait_updates(journal) which waits for
trasaction->t_updates to reach 0. But it doesn't wait for
journal->j_reserved_credits to reach 0. I saw a performance
improvement by removing waiting on reserved handles in FC commit code
as well. Given that JBD2 doesn't wait, I (perhaps incorrectly) thought
that it'd be okay to not wait in FC as well. Could you help me
understand why the JBD2 journal commit doesn't need to wait for
reserved handles?

- Harshad
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks()
  2024-06-28 14:18   ` Jan Kara
@ 2024-07-13  2:01     ` harshad shirwadkar
  2024-07-17 13:07       ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: harshad shirwadkar @ 2024-07-13  2:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, tytso, saukad, harshads

On Fri, Jun 28, 2024 at 7:18 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 29-05-24 01:20:00, Harshad Shirwadkar wrote:
> > Add nolock flag to ext4_map_blocks() which skips grabbing
> > i_data_sem in ext4_map_blocks. In FC commit path, we first
> > mark the inode as committing and thereby prevent any mutations
> > on it. Thus, it should be safe to call ext4_map_blocks()
> > without i_data_sem in this case. This is a workaround to
> > the problem mentioned in RFC V4 version cover letter[1] of this
> > patch series which pointed out that there is in incosistency between
> > ext4_map_blocks() behavior when EXT4_GET_BLOCKS_CACHED_NOWAIT is
> > passed. This patch gets rid of the need to call ext4_map_blocks()
> > with EXT4_GET_BLOCKS_CACHED_NOWAIT and instead call it with
> > EXT4_GET_BLOCKS_NOLOCK. I verified that generic/311 which failed
> > in cached_nowait mode passes with nolock mode.
> >
> > [1] https://lwn.net/Articles/902022/
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
>
> I'm sorry I forgot since last time - can you remind me why we cannot we
> grab i_data_sem from ext4_fc_write_inode_data()? Because as you write
> above, nobody should really be holding that lock while inode is
> EXT4_STATE_FC_COMMITTING anyway...
>
The original reason was that the commit path calls ext4_map_blocks()
which needs i_data_sem. But other places might grab i_data_sem and
then call ext4_mark_inode_dirty(). Ext4_mark_inode_dirty() can block
for a fast commit to finish, causing a deadlock.

In this patchset I'm attacking this problem 2 ways:
(1) Ensure i_data_sem is always grabbed before ext4_mark_inode_dirty()
(2) (This patch) Remove the need of grabbing i_data_sem in
ext4_map_blocks() when in the commit path.

I am now realizing either (1) or (2) is sufficient -- both are not
needed. (2) is more maintainable. (1) seems fragile and future code
paths can potentially break that rule which can cause hard to debug
failures. So, how about just keeping this patch and dropping the need
to remove grab i_data_sem before ext4_mark_inode_dirty()? If no
concerns, I'll handle this in V7.

- Harshad

>                                                                 Honza
>
> > ---
> >  fs/ext4/ext4.h        |  1 +
> >  fs/ext4/fast_commit.c | 16 ++++++++--------
> >  fs/ext4/inode.c       | 14 ++++++++++++--
> >  3 files changed, 21 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index d802040e94df..196c513f82dd 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -720,6 +720,7 @@ enum {
> >  #define EXT4_GET_BLOCKS_IO_SUBMIT            0x0400
> >       /* Caller is in the atomic contex, find extent if it has been cached */
> >  #define EXT4_GET_BLOCKS_CACHED_NOWAIT                0x0800
> > +#define EXT4_GET_BLOCKS_NOLOCK                       0x1000
> >
> >  /*
> >   * The bit position of these flags must not overlap with any of the
> > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > index b81b0292aa59..0b7064f8dfa5 100644
> > --- a/fs/ext4/fast_commit.c
> > +++ b/fs/ext4/fast_commit.c
> > @@ -559,13 +559,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
> >               !list_empty(&ei->i_fc_list))
> >               return;
> >
> > -     /*
> > -      * If we come here, we may sleep while waiting for the inode to
> > -      * commit. We shouldn't be holding i_data_sem in write mode when we go
> > -      * to sleep since the commit path needs to grab the lock while
> > -      * committing the inode.
> > -      */
> > -     WARN_ON(lockdep_is_held_type(&ei->i_data_sem, 1));
> >
> >       while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> >  #if (BITS_PER_LONG < 64)
> > @@ -898,7 +891,14 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> >       while (cur_lblk_off <= new_blk_size) {
> >               map.m_lblk = cur_lblk_off;
> >               map.m_len = new_blk_size - cur_lblk_off + 1;
> > -             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > +             /*
> > +              * Given that this inode is being committed,
> > +              * EXT4_STATE_FC_COMMITTING is already set on this inode.
> > +              * Which means all the mutations on the inode are paused
> > +              * until the commit operation is complete. Thus it is safe
> > +              * call ext4_map_blocks() in no lock mode.
> > +              */
> > +             ret = ext4_map_blocks(NULL, inode, &map, EXT4_GET_BLOCKS_NOLOCK);
> >               if (ret < 0)
> >                       return -ECANCELED;
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 61ffbdc2fb16..f00408017c7a 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -546,7 +546,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >        * Try to see if we can get the block without requesting a new
> >        * file system block.
> >        */
> > -     down_read(&EXT4_I(inode)->i_data_sem);
> > +     if (!(flags & EXT4_GET_BLOCKS_NOLOCK))
> > +             down_read(&EXT4_I(inode)->i_data_sem);
> >       if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> >               retval = ext4_ext_map_blocks(handle, inode, map, 0);
> >       } else {
> > @@ -573,7 +574,15 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >               ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> >                                     map->m_pblk, status);
> >       }
> > -     up_read((&EXT4_I(inode)->i_data_sem));
> > +     /*
> > +      * We should never call ext4_map_blocks() in nolock mode outside
> > +      * of fast commit path.
> > +      */
> > +     WARN_ON((flags & EXT4_GET_BLOCKS_NOLOCK) &&
> > +             !ext4_test_inode_state(inode,
> > +                                    EXT4_STATE_FC_COMMITTING));
> > +     if (!(flags & EXT4_GET_BLOCKS_NOLOCK))
> > +             up_read((&EXT4_I(inode)->i_data_sem));
> >
> >  found:
> >       if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> > @@ -614,6 +623,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >        * the write lock of i_data_sem, and call get_block()
> >        * with create == 1 flag.
> >        */
> > +     WARN_ON((flags & EXT4_GET_BLOCKS_NOLOCK) != 0);
> >       down_write(&EXT4_I(inode)->i_data_sem);
> >
> >       /*
> > --
> > 2.45.1.288.g0e0cd299f1-goog
> >
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 04/10] ext4: rework fast commit commit path
  2024-07-13  1:38     ` harshad shirwadkar
@ 2024-07-17 12:11       ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-07-17 12:11 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Jan Kara, linux-ext4, tytso, saukad, harshads

On Fri 12-07-24 18:38:21, harshad shirwadkar wrote:
> On Fri, Jun 28, 2024 at 6:43 AM Jan Kara <jack@suse.cz> wrote:
> > > +static void __jbd2_journal_lock_updates(journal_t *journal, bool wait_on_rsv)
> > >  {
> > >       jbd2_might_wait_for_commit(journal);
> > >
> > >       write_lock(&journal->j_state_lock);
> > >       ++journal->j_barrier_count;
> > >
> > > -     /* Wait until there are no reserved handles */
> > > -     if (atomic_read(&journal->j_reserved_credits)) {
> > > +     if (wait_on_rsv && atomic_read(&journal->j_reserved_credits)) {
> > > +             /* Wait until there are no reserved handles */
> >
> > So it is not as simple as this. start_this_handle() ignores
> > journal->j_barrier_count for reserved handles so they would happily start
> > while you have the journal locked with jbd2_journal_lock_updates_no_rsv()
> > and then writeback code could mess with your fastcommit state. Or perhaps I
> > miss some subtlety why this is fine - but that then deserves a good
> > explanation in a comment or maybe a different API because currently
> > jbd2_journal_lock_updates_no_rsv() doesn't do what one would naively
> > expect.
> 
> AFAICT, jbd2_journal_commit_transaction() only calls
> jbd2_journal_wait_updates(journal) which waits for
> trasaction->t_updates to reach 0. But it doesn't wait for
> journal->j_reserved_credits to reach 0. I saw a performance
> improvement by removing waiting on reserved handles in FC commit code
> as well. Given that JBD2 doesn't wait, I (perhaps incorrectly) thought
> that it'd be okay to not wait in FC as well. Could you help me
> understand why the JBD2 journal commit doesn't need to wait for
> reserved handles?

Sure. When we do page writeback, we may need to do some metadata
modifications (such as clearing unwritten bits in the extent tree) before
the writeback can complete and we can clear PG_Writeback bits. Hence if we
started a normal transaction after IO completes to do metadata
modifications, we could easily deadlock with transaction commit - broadly
speaking transaction commit waits for PG_Writeback to clear, page writeback
code waits for transaction commit so that it can free space in the journal
and start a new transaction. This is why reserved transactions were
introduced. Their main point is: If you have handle reserved, you are
guaranteed you can start this handle without blocking waiting for space in
the journal. Note that we could also start a normal handle before doing
page writeback and use it after IO completion for metadata changes and this
would work in principle (besides the technical troubles with propagating
the handle to IO completion context) but because we can writeback quite
large chunks of data, these handles could be running for tens of seconds
which would make other filesystem operations starve. Thus we allow reserved
(but not yet started) handles to be moved from currently running
transaction into a next one and this is fine because the reservation code
makes sure there's always enough free space in the journal for reserved
handles. So commit code can happily do transaction commit while reserved
handles are existing because if their owner decides to start them, they'll
just join the currently running transaction (or create one if there's
none).

But jbd2_journal_lock_updates() always needs to wait for all outstanding
handles including the reserved ones because it needs to guarantee there are
no modifications pending for the journal. Even you in fastcommit code rely
on inode list not being modified after jbd2_journal_lock_updates() and
reserved handles would break that assumption because existing reserved
handles can start after your version of jbd2_journal_lock_updates() not
waiting for reserved handles would return. And reserved handles need to be
able to start while jbd2_journal_lock_updates() is waiting for the journal
to quiesce because we need page writeback to finish before we can quiesce
the journal.

In theory you could create new journal locking mechanism for fastcommit
code that would *also* block starting of reserved handles since what
fastcommit needs to do with a locked journal does not depend on page
writeback. But frankly I'm not convinced this complication is worth it.

I hope this makes things clearer.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks()
  2024-07-13  2:01     ` harshad shirwadkar
@ 2024-07-17 13:07       ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2024-07-17 13:07 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Jan Kara, linux-ext4, tytso, saukad, harshads

On Fri 12-07-24 19:01:25, harshad shirwadkar wrote:
> On Fri, Jun 28, 2024 at 7:18 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 29-05-24 01:20:00, Harshad Shirwadkar wrote:
> > > Add nolock flag to ext4_map_blocks() which skips grabbing
> > > i_data_sem in ext4_map_blocks. In FC commit path, we first
> > > mark the inode as committing and thereby prevent any mutations
> > > on it. Thus, it should be safe to call ext4_map_blocks()
> > > without i_data_sem in this case. This is a workaround to
> > > the problem mentioned in RFC V4 version cover letter[1] of this
> > > patch series which pointed out that there is in incosistency between
> > > ext4_map_blocks() behavior when EXT4_GET_BLOCKS_CACHED_NOWAIT is
> > > passed. This patch gets rid of the need to call ext4_map_blocks()
> > > with EXT4_GET_BLOCKS_CACHED_NOWAIT and instead call it with
> > > EXT4_GET_BLOCKS_NOLOCK. I verified that generic/311 which failed
> > > in cached_nowait mode passes with nolock mode.
> > >
> > > [1] https://lwn.net/Articles/902022/
> > >
> > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> >
> > I'm sorry I forgot since last time - can you remind me why we cannot we
> > grab i_data_sem from ext4_fc_write_inode_data()? Because as you write
> > above, nobody should really be holding that lock while inode is
> > EXT4_STATE_FC_COMMITTING anyway...
> >
> The original reason was that the commit path calls ext4_map_blocks()
> which needs i_data_sem. But other places might grab i_data_sem and
> then call ext4_mark_inode_dirty(). Ext4_mark_inode_dirty() can block
> for a fast commit to finish, causing a deadlock.
> 
> In this patchset I'm attacking this problem 2 ways:
> (1) Ensure i_data_sem is always grabbed before ext4_mark_inode_dirty()

I think this rather should be: Make sure the inode is properly tracked with
fastcommit code (which waits for EXT4_STATE_FC_COMMITTING) before grabbing
i_data_sem, shouldn't it?

> (2) (This patch) Remove the need of grabbing i_data_sem in
> ext4_map_blocks() when in the commit path.
> 
> I am now realizing either (1) or (2) is sufficient -- both are not
> needed.

Yes, this is what was confusing me somewhat.

> (2) is more maintainable. (1) seems fragile and future code
> paths can potentially break that rule which can cause hard to debug
> failures. So, how about just keeping this patch and dropping the need
> to remove grab i_data_sem before ext4_mark_inode_dirty()? If no
> concerns, I'll handle this in V7.

Well, you have added assertions into ext4_mark_inode_dirty() exactly to
catch possible problems with inode not being tracked with fastcommit code.
I agree 1) needs changes in more places but long term, it actually seems
*less* fragile with the assertions added. Because adding conditional
locking to our core block mapping function and relying on the fact that
nobody can modify the mapping structures while EXT4_STATE_FC_COMMITTING is
set is quite hard to assert for and the failures are going to be hard to
debug as they will result in random memory corruptions, oopses etc. So I
believe you should rather remove 2).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-07-17 13:07 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29  1:19 [PATCH v6 00/10] Ext4 fast commit performance patch series Harshad Shirwadkar
2024-05-29  1:19 ` [PATCH v6 01/10] ext4: convert i_fc_lock to spinlock Harshad Shirwadkar
2024-06-21 16:19   ` Jan Kara
2024-05-29  1:19 ` [PATCH v6 02/10] ext4: for committing inode, make ext4_fc_track_inode wait Harshad Shirwadkar
2024-06-21 16:33   ` Jan Kara
2024-06-28 14:45   ` Jan Kara
2024-07-01 22:08   ` Theodore Ts'o
2024-07-12 17:09     ` harshad shirwadkar
2024-05-29  1:19 ` [PATCH v6 03/10] ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr Harshad Shirwadkar
2024-06-28 13:15   ` Jan Kara
2024-05-29  1:19 ` [PATCH v6 04/10] ext4: rework fast commit commit path Harshad Shirwadkar
2024-06-28 13:43   ` Jan Kara
2024-07-13  1:38     ` harshad shirwadkar
2024-07-17 12:11       ` Jan Kara
2024-05-29  1:19 ` [PATCH v6 05/10] ext4: drop i_fc_updates from inode fc info Harshad Shirwadkar
2024-05-29  1:19 ` [PATCH v6 06/10] ext4: update code documentation Harshad Shirwadkar
2024-05-29  1:20 ` [PATCH v6 07/10] ext4: add nolock mode to ext4_map_blocks() Harshad Shirwadkar
2024-06-28 14:18   ` Jan Kara
2024-07-13  2:01     ` harshad shirwadkar
2024-07-17 13:07       ` Jan Kara
2024-05-29  1:20 ` [PATCH v6 08/10] ext4: introduce selective flushing in fast commit Harshad Shirwadkar
2024-06-28 14:33   ` Jan Kara
2024-05-29  1:20 ` [PATCH v6 09/10] ext4: temporarily elevate commit thread priority Harshad Shirwadkar
2024-06-28 14:42   ` Jan Kara
2024-05-29  1:20 ` [PATCH v6 10/10] ext4: make fast commit ineligible on ext4_reserve_inode_write failure Harshad Shirwadkar
2024-06-28 14:47   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).