[PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle
@ 2025-03-06 14:28 Ojaswin Mujoo
  2025-03-06 14:28 ` [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper Ojaswin Mujoo
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-06 14:28 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Jan Kara, Baokun Li, linux-kernel

** Changes since v1 [1] **

 * Picked up RVBs from Jan and Ritesh
 * In patch 2/3, we now use a flag in sbi instead of SB_ACITVE
   to determine when to journal sb vs when to commit directly.
 * Added a prep patch 1/3

[1] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#m5e659425b8c8fe2ac01e7242b77fed315ff89db4

@Baokun, I didn't get a chance to look into the journal_inode
modifications we were discussing in [2]. I'll try to spend some time and
send that as a separate patch. Hope that's okay

[2] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#mad8feb44d9b6ddadf87830b92caa7b78d902dc05

** Original Cover **

When running LTP stress tests on ext4, after a multiday run we seemed to
have hit the following BUG_ON:

 [NIP  : start_this_handle+268]
 #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2]  (unreliable)
 #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
 #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
 #6 [c000001067c27c70] process_one_work at c000000000188790
 #7 [c000001067c27d20] worker_thread at c00000000018973c
 #8 [c000001067c27dc0] kthread at c000000000196c84
 #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64

Which comes out to

  382   repeat:
  383           read_lock(&journal->j_state_lock);
* 384           BUG_ON(journal->j_flags & JBD2_UNMOUNT);
  385           if (is_journal_aborted(journal) ||
  386               (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
  387                   read_unlock(&journal->j_state_lock);

Initially this seemed like it should never happen but upon crash
analysis it seems like it could indeed be hit as described in patch 1/2.

I would like to add that through the logs we only knew that:

- ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
- update_super_work had hit the BUG_ON

I was not able to hit this bug again (without modifying the kernel to
inject errors) but the above backtrace seems to be one possible paths
where this BUG_ON can be hit. Rest of the analysis and fix is in patch
2/3. Patch 3 is just a small tweak that i found helpful while debugging.

That being said, journalling is something I'm not very familiar with and
there might be gaps in my understanding so thoughts and suggestions are
welcome.

Ojaswin Mujoo (3):
  ext4: define ext4_journal_destroy wrapper
  ext4: avoid journaling sb update on error if journal is destroying
  ext4: Make sb update interval tunable

 fs/ext4/ext4.h      | 11 +++++++++++
 fs/ext4/ext4_jbd2.h | 22 ++++++++++++++++++++++
 fs/ext4/super.c     | 35 +++++++++++++++++------------------
 fs/ext4/sysfs.c     |  4 ++++
 4 files changed, 54 insertions(+), 18 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper
  2025-03-06 14:28 [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
@ 2025-03-06 14:28 ` Ojaswin Mujoo
  2025-03-07 14:05   ` Jan Kara
  2025-03-08  1:18   ` Baokun Li
  2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-06 14:28 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Jan Kara, Baokun Li, linux-kernel

Define an ext4 wrapper over jbd2_journal_destroy to make sure we
have consistent behavior during journal destruction. This will also
come useful in the next patch where we add some ext4 specific logic
in the destroy path.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4_jbd2.h | 14 ++++++++++++++
 fs/ext4/super.c     | 16 ++++++----------
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 3f2596c9e5f2..9b3c9df02a39 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -429,4 +429,18 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
 	return 1;
 }
 
+/*
+ * Pass journal explicitly as it may not be cached in the sbi->s_journal in some
+ * cases
+ */
+static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *journal)
+{
+	int err = 0;
+
+	err = jbd2_journal_destroy(journal);
+	sbi->s_journal = NULL;
+
+	return err;
+}
+
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a963ffda692a..8ad664d47806 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1297,8 +1297,7 @@ static void ext4_put_super(struct super_block *sb)
 
 	if (sbi->s_journal) {
 		aborted = is_journal_aborted(sbi->s_journal);
-		err = jbd2_journal_destroy(sbi->s_journal);
-		sbi->s_journal = NULL;
+		err = ext4_journal_destroy(sbi, sbi->s_journal);
 		if ((err < 0) && !aborted) {
 			ext4_abort(sb, -err, "Couldn't clean up the journal");
 		}
@@ -4960,8 +4959,7 @@ static int ext4_load_and_init_journal(struct super_block *sb,
 out:
 	/* flush s_sb_upd_work before destroying the journal. */
 	flush_work(&sbi->s_sb_upd_work);
-	jbd2_journal_destroy(sbi->s_journal);
-	sbi->s_journal = NULL;
+	ext4_journal_destroy(sbi, sbi->s_journal);
 	return -EINVAL;
 }
 
@@ -5652,8 +5650,7 @@ failed_mount8: __maybe_unused
 	if (sbi->s_journal) {
 		/* flush s_sb_upd_work before journal destroy. */
 		flush_work(&sbi->s_sb_upd_work);
-		jbd2_journal_destroy(sbi->s_journal);
-		sbi->s_journal = NULL;
+		ext4_journal_destroy(sbi, sbi->s_journal);
 	}
 failed_mount3a:
 	ext4_es_unregister_shrinker(sbi);
@@ -5958,7 +5955,7 @@ static journal_t *ext4_open_dev_journal(struct super_block *sb,
 	return journal;
 
 out_journal:
-	jbd2_journal_destroy(journal);
+	ext4_journal_destroy(EXT4_SB(sb), journal);
 out_bdev:
 	bdev_fput(bdev_file);
 	return ERR_PTR(errno);
@@ -6075,8 +6072,7 @@ static int ext4_load_journal(struct super_block *sb,
 	EXT4_SB(sb)->s_journal = journal;
 	err = ext4_clear_journal_err(sb, es);
 	if (err) {
-		EXT4_SB(sb)->s_journal = NULL;
-		jbd2_journal_destroy(journal);
+		ext4_journal_destroy(EXT4_SB(sb), journal);
 		return err;
 	}
 
@@ -6094,7 +6090,7 @@ static int ext4_load_journal(struct super_block *sb,
 	return 0;
 
 err_out:
-	jbd2_journal_destroy(journal);
+	ext4_journal_destroy(EXT4_SB(sb), journal);
 	return err;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-06 14:28 [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
  2025-03-06 14:28 ` [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper Ojaswin Mujoo
@ 2025-03-06 14:28 ` Ojaswin Mujoo
  2025-03-07  2:49   ` Zhang Yi
                     ` (3 more replies)
  2025-03-06 14:28 ` [PATCH v2 3/3] ext4: Make sb update interval tunable Ojaswin Mujoo
  2025-03-08  1:09 ` [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Baokun Li
  3 siblings, 4 replies; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-06 14:28 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar

Presently we always BUG_ON if trying to start a transaction on a journal marked
with JBD2_UNMOUNT, since this should never happen. However, while ltp running
stress tests, it was observed that in case of some error handling paths, it is
possible for update_super_work to start a transaction after the journal is
destroyed eg:

(umount)
ext4_kill_sb
  kill_block_super
    generic_shutdown_super
      sync_filesystem /* commits all txns */
      evict_inodes
        /* might start a new txn */
      ext4_put_super
	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
        jbd2_journal_destroy
          journal_kill_thread
            journal->j_flags |= JBD2_UNMOUNT;
          jbd2_journal_commit_transaction
            jbd2_journal_get_descriptor_buffer
              jbd2_journal_bmap
                ext4_journal_bmap
                  ext4_map_blocks
                    ...
                    ext4_inode_error
                      ext4_handle_error
                        schedule_work(&sbi->s_sb_upd_work)

                                               /* work queue kicks in */
                                               update_super_work
                                                 jbd2_journal_start
                                                   start_this_handle
                                                     BUG_ON(journal->j_flags &
                                                            JBD2_UNMOUNT)

Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
destroying only do a journaled (and deferred) update of sb if this flag is not
set. Otherwise, just fallback to an un-journaled commit.

We set sbi->s_journal_destroying = true only after all the FS updates are done
during ext4_put_super() (except a running transaction that will get commited
during jbd2_journal_destroy()). After this point, it is safe to commit the sb
outside the journal as it won't race with a journaled update (refer
2d01ddc86606).

Also, we don't need a similar check in ext4_grp_locked_error since it is only
called from mballoc and AFAICT it would be always valid to schedule work here.

Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h      | 2 ++
 fs/ext4/ext4_jbd2.h | 8 ++++++++
 fs/ext4/super.c     | 4 +++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2b7d781bfcad..d48e93bd5690 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1728,6 +1728,8 @@ struct ext4_sb_info {
 	 */
 	struct work_struct s_sb_upd_work;
 
+	bool s_journal_destorying;
+
 	/* Atomic write unit values in bytes */
 	unsigned int s_awu_min;
 	unsigned int s_awu_max;
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 9b3c9df02a39..6bd3ca84410d 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
 {
 	int err = 0;
 
+	/*
+	 * At this point all pending FS updates should be done except a possible
+	 * running transaction (which will commit in jbd2_journal_destroy). It
+	 * is now safe for any new errors to directly commit superblock rather
+	 * than going via journal.
+	 */
+	sbi->s_journal_destorying = true;
+
 	err = jbd2_journal_destroy(journal);
 	sbi->s_journal = NULL;
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8ad664d47806..31552cf0519a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
 		 * constraints, it may not be safe to do it right here so we
 		 * defer superblock flushing to a workqueue.
 		 */
-		if (continue_fs && journal)
+		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
 			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
 		else
 			ext4_commit_super(sb);
@@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	spin_lock_init(&sbi->s_error_lock);
 	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
 
+	sbi->s_journal_destorying = false;
+
 	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
 	if (err)
 		goto failed_mount3;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 3/3] ext4: Make sb update interval tunable
  2025-03-06 14:28 [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
  2025-03-06 14:28 ` [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper Ojaswin Mujoo
  2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
@ 2025-03-06 14:28 ` Ojaswin Mujoo
  2025-03-08  2:25   ` Baokun Li
  2025-03-08  1:09 ` [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Baokun Li
  3 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-06 14:28 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Jan Kara, Baokun Li, linux-kernel, Ritesh Harjani (IBM)

Currently, outside error paths, we auto commit the super block after 1
hour has passed and 16MB worth of updates have been written since last
commit. This is a policy decision so make this tunable while keeping the
defaults same. This is useful if user wants to tweak the superblock
behavior or for debugging the codepath by allowing to trigger it more
frequently.

We can now tweak the super block update using sb_update_sec and
sb_update_kb files in /sys/fs/ext4/<dev>/

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h  |  9 +++++++++
 fs/ext4/super.c | 15 ++++++++-------
 fs/ext4/sysfs.c |  4 ++++
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d48e93bd5690..82c902ed86f3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1608,6 +1608,8 @@ struct ext4_sb_info {
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
+	unsigned int s_sb_update_sec;
+	unsigned int s_sb_update_kb;
 
 	/* stats for buddy allocator */
 	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
@@ -2281,6 +2283,13 @@ static inline int ext4_forced_shutdown(struct super_block *sb)
 #define EXT4_DEF_MIN_BATCH_TIME	0
 #define EXT4_DEF_MAX_BATCH_TIME	15000 /* 15ms */
 
+/*
+ * Default values for superblock update
+ */
+#define EXT4_DEF_SB_UPDATE_INTERVAL_SEC (3600) /* seconds (1 hour) */
+#define EXT4_DEF_SB_UPDATE_INTERVAL_KB (16384) /* kilobytes (16MB) */
+
+
 /*
  * Minimum number of groups in a flexgroup before we separate out
  * directories into the first block group of a flexgroup
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 31552cf0519a..1b47b111c583 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -447,9 +447,6 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
 #define ext4_get_tstamp(es, tstamp) \
 	__ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
 
-#define EXT4_SB_REFRESH_INTERVAL_SEC (3600) /* seconds (1 hour) */
-#define EXT4_SB_REFRESH_INTERVAL_KB (16384) /* kilobytes (16MB) */
-
 /*
  * The ext4_maybe_update_superblock() function checks and updates the
  * superblock if needed.
@@ -457,8 +454,10 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
  * This function is designed to update the on-disk superblock only under
  * certain conditions to prevent excessive disk writes and unnecessary
  * waking of the disk from sleep. The superblock will be updated if:
- * 1. More than an hour has passed since the last superblock update, and
- * 2. More than 16MB have been written since the last superblock update.
+ * 1. More than sbi->s_sb_update_sec (def: 1 hour) has passed since the last
+ *    superblock update
+ * 2. More than sbi->s_sb_update_kb (def: 16MB) kbs have been written since the
+ *    last superblock update.
  *
  * @sb: The superblock
  */
@@ -479,7 +478,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
 	now = ktime_get_real_seconds();
 	last_update = ext4_get_tstamp(es, s_wtime);
 
-	if (likely(now - last_update < EXT4_SB_REFRESH_INTERVAL_SEC))
+	if (likely(now - last_update < sbi->s_sb_update_sec))
 		return;
 
 	lifetime_write_kbytes = sbi->s_kbytes_written +
@@ -494,7 +493,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
 	 */
 	diff_size = lifetime_write_kbytes - le64_to_cpu(es->s_kbytes_written);
 
-	if (diff_size > EXT4_SB_REFRESH_INTERVAL_KB)
+	if (diff_size > sbi->s_sb_update_kb)
 		schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
 }
 
@@ -5246,6 +5245,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
 	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
+	sbi->s_sb_update_kb = EXT4_DEF_SB_UPDATE_INTERVAL_KB;
+	sbi->s_sb_update_sec = EXT4_DEF_SB_UPDATE_INTERVAL_SEC;
 
 	/*
 	 * set default s_li_wait_mult for lazyinit, for the case there is
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index ddb54608ca2e..987bd00f916a 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -254,6 +254,8 @@ EXT4_ATTR(journal_task, 0444, journal_task);
 EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
 EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
 EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
+EXT4_RW_ATTR_SBI_UI(sb_update_sec, s_sb_update_sec);
+EXT4_RW_ATTR_SBI_UI(sb_update_kb, s_sb_update_kb);
 
 static unsigned int old_bump_val = 128;
 EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
@@ -305,6 +307,8 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(mb_prefetch),
 	ATTR_LIST(mb_prefetch_limit),
 	ATTR_LIST(last_trim_minblks),
+	ATTR_LIST(sb_update_sec),
+	ATTR_LIST(sb_update_kb),
 	NULL,
 };
 ATTRIBUTE_GROUPS(ext4);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
@ 2025-03-07  2:49   ` Zhang Yi
  2025-03-07  6:34     ` Ojaswin Mujoo
  2025-03-07 14:26   ` Jan Kara
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Zhang Yi @ 2025-03-07  2:49 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> Presently we always BUG_ON if trying to start a transaction on a journal marked
> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> stress tests, it was observed that in case of some error handling paths, it is
> possible for update_super_work to start a transaction after the journal is
> destroyed eg:
> 
> (umount)
> ext4_kill_sb
>   kill_block_super
>     generic_shutdown_super
>       sync_filesystem /* commits all txns */
>       evict_inodes
>         /* might start a new txn */
>       ext4_put_super
> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>         jbd2_journal_destroy
>           journal_kill_thread
>             journal->j_flags |= JBD2_UNMOUNT;
>           jbd2_journal_commit_transaction
>             jbd2_journal_get_descriptor_buffer
>               jbd2_journal_bmap
>                 ext4_journal_bmap
>                   ext4_map_blocks
>                     ...
>                     ext4_inode_error
>                       ext4_handle_error
>                         schedule_work(&sbi->s_sb_upd_work)
> 
>                                                /* work queue kicks in */
>                                                update_super_work
>                                                  jbd2_journal_start
>                                                    start_this_handle
>                                                      BUG_ON(journal->j_flags &
>                                                             JBD2_UNMOUNT)
> 
> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> destroying only do a journaled (and deferred) update of sb if this flag is not
> set. Otherwise, just fallback to an un-journaled commit.
> 
> We set sbi->s_journal_destroying = true only after all the FS updates are done
> during ext4_put_super() (except a running transaction that will get commited
> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> outside the journal as it won't race with a journaled update (refer
> 2d01ddc86606).
> 
> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> called from mballoc and AFAICT it would be always valid to schedule work here.
> 
> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/ext4/ext4.h      | 2 ++
>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>  fs/ext4/super.c     | 4 +++-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2b7d781bfcad..d48e93bd5690 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>  	 */
>  	struct work_struct s_sb_upd_work;
>  
> +	bool s_journal_destorying;
> +
>  	/* Atomic write unit values in bytes */
>  	unsigned int s_awu_min;
>  	unsigned int s_awu_max;
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 9b3c9df02a39..6bd3ca84410d 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>  {
>  	int err = 0;
>  
> +	/*
> +	 * At this point all pending FS updates should be done except a possible
> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> +	 * is now safe for any new errors to directly commit superblock rather
> +	 * than going via journal.
> +	 */
> +	sbi->s_journal_destorying = true;
> +

Hi, Ojaswin!

I'm afraid you still need to flush the superblock update work here,
otherwise I guess the race condition you mentioned in v1 could still
occur.

 ext4_put_super()
  flush_work(&sbi->s_sb_upd_work)

                    **kjournald2**
                    jbd2_journal_commit_transaction()
                    ...
                    ext4_inode_error()
                      /* JBD2_UNMOUNT not set */
                      schedule_work(s_sb_upd_work)

                                  **workqueue**
                                   update_super_work
                                   /* s_journal_destorying is not set */
                            	   if (journal && !s_journal_destorying)

  ext4_journal_destroy()
   /* set s_journal_destorying */
   sbi->s_journal_destorying = true;
   jbd2_journal_destroy()
    journal->j_flags |= JBD2_UNMOUNT;

                                       jbd2_journal_start()
                                        start_this_handle()
                                          BUG_ON(JBD2_UNMOUNT)

Thanks,
Yi.

>  	err = jbd2_journal_destroy(journal);
>  	sbi->s_journal = NULL;
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8ad664d47806..31552cf0519a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>  		 * constraints, it may not be safe to do it right here so we
>  		 * defer superblock flushing to a workqueue.
>  		 */
> -		if (continue_fs && journal)
> +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>  		else
>  			ext4_commit_super(sb);
> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	spin_lock_init(&sbi->s_error_lock);
>  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>  
> +	sbi->s_journal_destorying = false;
> +
>  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>  	if (err)
>  		goto failed_mount3;


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07  2:49   ` Zhang Yi
@ 2025-03-07  6:34     ` Ojaswin Mujoo
  2025-03-07  8:13       ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-07  6:34 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > stress tests, it was observed that in case of some error handling paths, it is
> > possible for update_super_work to start a transaction after the journal is
> > destroyed eg:
> > 
> > (umount)
> > ext4_kill_sb
> >   kill_block_super
> >     generic_shutdown_super
> >       sync_filesystem /* commits all txns */
> >       evict_inodes
> >         /* might start a new txn */
> >       ext4_put_super
> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >         jbd2_journal_destroy
> >           journal_kill_thread
> >             journal->j_flags |= JBD2_UNMOUNT;
> >           jbd2_journal_commit_transaction
> >             jbd2_journal_get_descriptor_buffer
> >               jbd2_journal_bmap
> >                 ext4_journal_bmap
> >                   ext4_map_blocks
> >                     ...
> >                     ext4_inode_error
> >                       ext4_handle_error
> >                         schedule_work(&sbi->s_sb_upd_work)
> > 
> >                                                /* work queue kicks in */
> >                                                update_super_work
> >                                                  jbd2_journal_start
> >                                                    start_this_handle
> >                                                      BUG_ON(journal->j_flags &
> >                                                             JBD2_UNMOUNT)
> > 
> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > destroying only do a journaled (and deferred) update of sb if this flag is not
> > set. Otherwise, just fallback to an un-journaled commit.
> > 
> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > during ext4_put_super() (except a running transaction that will get commited
> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > outside the journal as it won't race with a journaled update (refer
> > 2d01ddc86606).
> > 
> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > called from mballoc and AFAICT it would be always valid to schedule work here.
> > 
> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > ---
> >  fs/ext4/ext4.h      | 2 ++
> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >  fs/ext4/super.c     | 4 +++-
> >  3 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 2b7d781bfcad..d48e93bd5690 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >  	 */
> >  	struct work_struct s_sb_upd_work;
> >  
> > +	bool s_journal_destorying;
> > +
> >  	/* Atomic write unit values in bytes */
> >  	unsigned int s_awu_min;
> >  	unsigned int s_awu_max;
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 9b3c9df02a39..6bd3ca84410d 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >  {
> >  	int err = 0;
> >  
> > +	/*
> > +	 * At this point all pending FS updates should be done except a possible
> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > +	 * is now safe for any new errors to directly commit superblock rather
> > +	 * than going via journal.
> > +	 */
> > +	sbi->s_journal_destorying = true;
> > +
> 
> Hi, Ojaswin!
> 
> I'm afraid you still need to flush the superblock update work here,
> otherwise I guess the race condition you mentioned in v1 could still
> occur.
> 
>  ext4_put_super()
>   flush_work(&sbi->s_sb_upd_work)
> 
>                     **kjournald2**
>                     jbd2_journal_commit_transaction()
>                     ...
>                     ext4_inode_error()
>                       /* JBD2_UNMOUNT not set */
>                       schedule_work(s_sb_upd_work)
> 
>                                   **workqueue**
>                                    update_super_work
>                                    /* s_journal_destorying is not set */
>                             	   if (journal && !s_journal_destorying)
> 
>   ext4_journal_destroy()
>    /* set s_journal_destorying */
>    sbi->s_journal_destorying = true;
>    jbd2_journal_destroy()
>     journal->j_flags |= JBD2_UNMOUNT;
> 
>                                        jbd2_journal_start()
>                                         start_this_handle()
>                                           BUG_ON(JBD2_UNMOUNT)
> 
> Thanks,
> Yi.
Hi Yi,

Yes you are right, somehow missed this edge case :(

Alright then, we have to move out sbi->s_journal_destroying outside the
helper. Just wondering if I should still let it be in
ext4_journal_destroy and just add an extra s_journal_destroying = false
before schedule_work(s_sb_upd_work), because it makes sense.

Okay let me give it some thought but thanks for pointing this out!

Regards,
ojaswin

> 
> >  	err = jbd2_journal_destroy(journal);
> >  	sbi->s_journal = NULL;
> >  
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 8ad664d47806..31552cf0519a 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >  		 * constraints, it may not be safe to do it right here so we
> >  		 * defer superblock flushing to a workqueue.
> >  		 */
> > -		if (continue_fs && journal)
> > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >  		else
> >  			ext4_commit_super(sb);
> > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> >  	spin_lock_init(&sbi->s_error_lock);
> >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
> >  
> > +	sbi->s_journal_destorying = false;
> > +
> >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
> >  	if (err)
> >  		goto failed_mount3;
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07  6:34     ` Ojaswin Mujoo
@ 2025-03-07  8:13       ` Ojaswin Mujoo
  2025-03-07  8:43         ` Zhang Yi
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-07  8:13 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> > On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > > stress tests, it was observed that in case of some error handling paths, it is
> > > possible for update_super_work to start a transaction after the journal is
> > > destroyed eg:
> > > 
> > > (umount)
> > > ext4_kill_sb
> > >   kill_block_super
> > >     generic_shutdown_super
> > >       sync_filesystem /* commits all txns */
> > >       evict_inodes
> > >         /* might start a new txn */
> > >       ext4_put_super
> > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > >         jbd2_journal_destroy
> > >           journal_kill_thread
> > >             journal->j_flags |= JBD2_UNMOUNT;
> > >           jbd2_journal_commit_transaction
> > >             jbd2_journal_get_descriptor_buffer
> > >               jbd2_journal_bmap
> > >                 ext4_journal_bmap
> > >                   ext4_map_blocks
> > >                     ...
> > >                     ext4_inode_error
> > >                       ext4_handle_error
> > >                         schedule_work(&sbi->s_sb_upd_work)
> > > 
> > >                                                /* work queue kicks in */
> > >                                                update_super_work
> > >                                                  jbd2_journal_start
> > >                                                    start_this_handle
> > >                                                      BUG_ON(journal->j_flags &
> > >                                                             JBD2_UNMOUNT)
> > > 
> > > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > > destroying only do a journaled (and deferred) update of sb if this flag is not
> > > set. Otherwise, just fallback to an un-journaled commit.
> > > 
> > > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > > during ext4_put_super() (except a running transaction that will get commited
> > > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > > outside the journal as it won't race with a journaled update (refer
> > > 2d01ddc86606).
> > > 
> > > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > > called from mballoc and AFAICT it would be always valid to schedule work here.
> > > 
> > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > > Suggested-by: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > ---
> > >  fs/ext4/ext4.h      | 2 ++
> > >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> > >  fs/ext4/super.c     | 4 +++-
> > >  3 files changed, 13 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 2b7d781bfcad..d48e93bd5690 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > >  	 */
> > >  	struct work_struct s_sb_upd_work;
> > >  
> > > +	bool s_journal_destorying;
> > > +
> > >  	/* Atomic write unit values in bytes */
> > >  	unsigned int s_awu_min;
> > >  	unsigned int s_awu_max;
> > > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > > index 9b3c9df02a39..6bd3ca84410d 100644
> > > --- a/fs/ext4/ext4_jbd2.h
> > > +++ b/fs/ext4/ext4_jbd2.h
> > > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > >  {
> > >  	int err = 0;
> > >  
> > > +	/*
> > > +	 * At this point all pending FS updates should be done except a possible
> > > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > > +	 * is now safe for any new errors to directly commit superblock rather
> > > +	 * than going via journal.
> > > +	 */
> > > +	sbi->s_journal_destorying = true;
> > > +
> > 
> > Hi, Ojaswin!
> > 
> > I'm afraid you still need to flush the superblock update work here,
> > otherwise I guess the race condition you mentioned in v1 could still
> > occur.
> > 
> >  ext4_put_super()
> >   flush_work(&sbi->s_sb_upd_work)
> > 
> >                     **kjournald2**
> >                     jbd2_journal_commit_transaction()
> >                     ...
> >                     ext4_inode_error()
> >                       /* JBD2_UNMOUNT not set */
> >                       schedule_work(s_sb_upd_work)
> > 
> >                                   **workqueue**
> >                                    update_super_work
> >                                    /* s_journal_destorying is not set */
> >                             	   if (journal && !s_journal_destorying)
> > 
> >   ext4_journal_destroy()
> >    /* set s_journal_destorying */
> >    sbi->s_journal_destorying = true;
> >    jbd2_journal_destroy()
> >     journal->j_flags |= JBD2_UNMOUNT;
> > 
> >                                        jbd2_journal_start()
> >                                         start_this_handle()
> >                                           BUG_ON(JBD2_UNMOUNT)
> > 
> > Thanks,
> > Yi.
> Hi Yi,
> 
> Yes you are right, somehow missed this edge case :(
> 
> Alright then, we have to move out sbi->s_journal_destroying outside the
> helper. Just wondering if I should still let it be in
> ext4_journal_destroy and just add an extra s_journal_destroying = false
> before schedule_work(s_sb_upd_work), because it makes sense.
> 
> Okay let me give it some thought but thanks for pointing this out!
> 
> Regards,
> ojaswin

Okay so thinking about it a bit more, I see you also suggested to flush
the work after marking sbi->s_journal_destroying. But will that solve
it?

  ext4_put_super()
   flush_work(&sbi->s_sb_upd_work)
 
                     **kjournald2**
                     jbd2_journal_commit_transaction()
                     ...
                     ext4_inode_error()
                       /* JBD2_UNMOUNT not set */
                       schedule_work(s_sb_upd_work)
 
                                    **workqueue**
                                    update_super_work
                                    /* s_journal_destorying is not set */
                             	      if (journal && !s_journal_destorying)
 
   ext4_journal_destroy()
    /* set s_journal_destorying */
    sbi->s_journal_destorying = true;
    flush_work(&sbi->s_sb_upd_work)
                                      schedule_work()
    jbd2_journal_destroy()
     journal->j_flags |= JBD2_UNMOUNT;
 
                                        jbd2_journal_start()
                                         start_this_handle()
                                           BUG_ON(JBD2_UNMOUNT)


Seems like these edge cases keep sprouting up :)

As for the fix, how about we do something like this:

  ext4_put_super()

   flush_work(&sbi->s_sb_upd_work)
   destroy_workqueue(sbi->rsv_conversion_wq);

   ext4_journal_destroy()
    /* set s_journal_destorying */
    sbi->s_journal_destorying = true;

   /* trigger a commit and wait for it to complete */

    flush_work(&sbi->s_sb_upd_work)

    jbd2_journal_destroy()
     journal->j_flags |= JBD2_UNMOUNT;
 
                                        jbd2_journal_start()
                                         start_this_handle()
                                           BUG_ON(JBD2_UNMOUNT)

Still giving this codepath some thought but seems like this might just
be enough to fix the race. Thoughts on this?

Regards,
ojaswin

> > 
> > >  	err = jbd2_journal_destroy(journal);
> > >  	sbi->s_journal = NULL;
> > >  
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 8ad664d47806..31552cf0519a 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> > >  		 * constraints, it may not be safe to do it right here so we
> > >  		 * defer superblock flushing to a workqueue.
> > >  		 */
> > > -		if (continue_fs && journal)
> > > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
> > >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> > >  		else
> > >  			ext4_commit_super(sb);
> > > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> > >  	spin_lock_init(&sbi->s_error_lock);
> > >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
> > >  
> > > +	sbi->s_journal_destorying = false;
> > > +
> > >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
> > >  	if (err)
> > >  		goto failed_mount3;
> > 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07  8:13       ` Ojaswin Mujoo
@ 2025-03-07  8:43         ` Zhang Yi
  2025-03-07 10:27           ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Zhang Yi @ 2025-03-07  8:43 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o



On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>> possible for update_super_work to start a transaction after the journal is
>>>> destroyed eg:
>>>>
>>>> (umount)
>>>> ext4_kill_sb
>>>>   kill_block_super
>>>>     generic_shutdown_super
>>>>       sync_filesystem /* commits all txns */
>>>>       evict_inodes
>>>>         /* might start a new txn */
>>>>       ext4_put_super
>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>         jbd2_journal_destroy
>>>>           journal_kill_thread
>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>           jbd2_journal_commit_transaction
>>>>             jbd2_journal_get_descriptor_buffer
>>>>               jbd2_journal_bmap
>>>>                 ext4_journal_bmap
>>>>                   ext4_map_blocks
>>>>                     ...
>>>>                     ext4_inode_error
>>>>                       ext4_handle_error
>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>
>>>>                                                /* work queue kicks in */
>>>>                                                update_super_work
>>>>                                                  jbd2_journal_start
>>>>                                                    start_this_handle
>>>>                                                      BUG_ON(journal->j_flags &
>>>>                                                             JBD2_UNMOUNT)
>>>>
>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>
>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>> during ext4_put_super() (except a running transaction that will get commited
>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>> outside the journal as it won't race with a journaled update (refer
>>>> 2d01ddc86606).
>>>>
>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>
>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>> ---
>>>>  fs/ext4/ext4.h      | 2 ++
>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>  fs/ext4/super.c     | 4 +++-
>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>  	 */
>>>>  	struct work_struct s_sb_upd_work;
>>>>  
>>>> +	bool s_journal_destorying;
>>>> +
>>>>  	/* Atomic write unit values in bytes */
>>>>  	unsigned int s_awu_min;
>>>>  	unsigned int s_awu_max;
>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>> --- a/fs/ext4/ext4_jbd2.h
>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>  {
>>>>  	int err = 0;
>>>>  
>>>> +	/*
>>>> +	 * At this point all pending FS updates should be done except a possible
>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>> +	 * than going via journal.
>>>> +	 */
>>>> +	sbi->s_journal_destorying = true;
>>>> +
>>>
>>> Hi, Ojaswin!
>>>
>>> I'm afraid you still need to flush the superblock update work here,
>>> otherwise I guess the race condition you mentioned in v1 could still
>>> occur.
>>>
>>>  ext4_put_super()
>>>   flush_work(&sbi->s_sb_upd_work)
>>>
>>>                     **kjournald2**
>>>                     jbd2_journal_commit_transaction()
>>>                     ...
>>>                     ext4_inode_error()
>>>                       /* JBD2_UNMOUNT not set */
>>>                       schedule_work(s_sb_upd_work)
>>>
>>>                                   **workqueue**
>>>                                    update_super_work
>>>                                    /* s_journal_destorying is not set */
>>>                             	   if (journal && !s_journal_destorying)
>>>
>>>   ext4_journal_destroy()
>>>    /* set s_journal_destorying */
>>>    sbi->s_journal_destorying = true;
>>>    jbd2_journal_destroy()
>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>
>>>                                        jbd2_journal_start()
>>>                                         start_this_handle()
>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>
>>> Thanks,
>>> Yi.
>> Hi Yi,
>>
>> Yes you are right, somehow missed this edge case :(
>>
>> Alright then, we have to move out sbi->s_journal_destroying outside the
>> helper. Just wondering if I should still let it be in
>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>> before schedule_work(s_sb_upd_work), because it makes sense.
>>
>> Okay let me give it some thought but thanks for pointing this out!
>>
>> Regards,
>> ojaswin
> 
> Okay so thinking about it a bit more, I see you also suggested to flush
> the work after marking sbi->s_journal_destroying. But will that solve
> it?
> 
>   ext4_put_super()
>    flush_work(&sbi->s_sb_upd_work)
>  
>                      **kjournald2**
>                      jbd2_journal_commit_transaction()
>                      ...
>                      ext4_inode_error()
>                        /* JBD2_UNMOUNT not set */
>                        schedule_work(s_sb_upd_work)
>  
>                                     **workqueue**
>                                     update_super_work
>                                     /* s_journal_destorying is not set */
>                              	      if (journal && !s_journal_destorying)
>  
>    ext4_journal_destroy()
>     /* set s_journal_destorying */
>     sbi->s_journal_destorying = true;
>     flush_work(&sbi->s_sb_upd_work)
>                                       schedule_work()
                                        ^^^^^^^^^^^^^^^
                                        where does this come from?

After this flush_work, we can guarantee that the running s_sb_upd_work
finishes before we set JBD2_UNMOUNT. Additionally, the journal will
not commit transaction or call schedule_work() again because it has
been aborted due to the previous error. Am I missing something?

Thanks,
Yi.

>     jbd2_journal_destroy()
>      journal->j_flags |= JBD2_UNMOUNT;
>  
>                                         jbd2_journal_start()
>                                          start_this_handle()
>                                            BUG_ON(JBD2_UNMOUNT)
> 
> 
> Seems like these edge cases keep sprouting up :)
> 
> As for the fix, how about we do something like this:
> 
>   ext4_put_super()
> 
>    flush_work(&sbi->s_sb_upd_work)
>    destroy_workqueue(sbi->rsv_conversion_wq);
> 
>    ext4_journal_destroy()
>     /* set s_journal_destorying */
>     sbi->s_journal_destorying = true;
> 
>    /* trigger a commit and wait for it to complete */
> 
>     flush_work(&sbi->s_sb_upd_work)
> 
>     jbd2_journal_destroy()
>      journal->j_flags |= JBD2_UNMOUNT;
>  
>                                         jbd2_journal_start()
>                                          start_this_handle()
>                                            BUG_ON(JBD2_UNMOUNT)
> 
> Still giving this codepath some thought but seems like this might just
> be enough to fix the race. Thoughts on this?
> 
> Regards,
> ojaswin
> 
>>>
>>>>  	err = jbd2_journal_destroy(journal);
>>>>  	sbi->s_journal = NULL;
>>>>  
>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>> index 8ad664d47806..31552cf0519a 100644
>>>> --- a/fs/ext4/super.c
>>>> +++ b/fs/ext4/super.c
>>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>>>>  		 * constraints, it may not be safe to do it right here so we
>>>>  		 * defer superblock flushing to a workqueue.
>>>>  		 */
>>>> -		if (continue_fs && journal)
>>>> +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>>>>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>>>>  		else
>>>>  			ext4_commit_super(sb);
>>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>>>>  	spin_lock_init(&sbi->s_error_lock);
>>>>  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>>>>  
>>>> +	sbi->s_journal_destorying = false;
>>>> +
>>>>  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>>>>  	if (err)
>>>>  		goto failed_mount3;
>>>
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07  8:43         ` Zhang Yi
@ 2025-03-07 10:27           ` Ojaswin Mujoo
  2025-03-07 12:36             ` Zhang Yi
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-07 10:27 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> 
> 
> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> > On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> >> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> >>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> >>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
> >>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >>>> stress tests, it was observed that in case of some error handling paths, it is
> >>>> possible for update_super_work to start a transaction after the journal is
> >>>> destroyed eg:
> >>>>
> >>>> (umount)
> >>>> ext4_kill_sb
> >>>>   kill_block_super
> >>>>     generic_shutdown_super
> >>>>       sync_filesystem /* commits all txns */
> >>>>       evict_inodes
> >>>>         /* might start a new txn */
> >>>>       ext4_put_super
> >>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >>>>         jbd2_journal_destroy
> >>>>           journal_kill_thread
> >>>>             journal->j_flags |= JBD2_UNMOUNT;
> >>>>           jbd2_journal_commit_transaction
> >>>>             jbd2_journal_get_descriptor_buffer
> >>>>               jbd2_journal_bmap
> >>>>                 ext4_journal_bmap
> >>>>                   ext4_map_blocks
> >>>>                     ...
> >>>>                     ext4_inode_error
> >>>>                       ext4_handle_error
> >>>>                         schedule_work(&sbi->s_sb_upd_work)
> >>>>
> >>>>                                                /* work queue kicks in */
> >>>>                                                update_super_work
> >>>>                                                  jbd2_journal_start
> >>>>                                                    start_this_handle
> >>>>                                                      BUG_ON(journal->j_flags &
> >>>>                                                             JBD2_UNMOUNT)
> >>>>
> >>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >>>> destroying only do a journaled (and deferred) update of sb if this flag is not
> >>>> set. Otherwise, just fallback to an un-journaled commit.
> >>>>
> >>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
> >>>> during ext4_put_super() (except a running transaction that will get commited
> >>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >>>> outside the journal as it won't race with a journaled update (refer
> >>>> 2d01ddc86606).
> >>>>
> >>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >>>> called from mballoc and AFAICT it would be always valid to schedule work here.
> >>>>
> >>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> >>>> Suggested-by: Jan Kara <jack@suse.cz>
> >>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >>>> ---
> >>>>  fs/ext4/ext4.h      | 2 ++
> >>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >>>>  fs/ext4/super.c     | 4 +++-
> >>>>  3 files changed, 13 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>>> index 2b7d781bfcad..d48e93bd5690 100644
> >>>> --- a/fs/ext4/ext4.h
> >>>> +++ b/fs/ext4/ext4.h
> >>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >>>>  	 */
> >>>>  	struct work_struct s_sb_upd_work;
> >>>>  
> >>>> +	bool s_journal_destorying;
> >>>> +
> >>>>  	/* Atomic write unit values in bytes */
> >>>>  	unsigned int s_awu_min;
> >>>>  	unsigned int s_awu_max;
> >>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >>>> index 9b3c9df02a39..6bd3ca84410d 100644
> >>>> --- a/fs/ext4/ext4_jbd2.h
> >>>> +++ b/fs/ext4/ext4_jbd2.h
> >>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >>>>  {
> >>>>  	int err = 0;
> >>>>  
> >>>> +	/*
> >>>> +	 * At this point all pending FS updates should be done except a possible
> >>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> >>>> +	 * is now safe for any new errors to directly commit superblock rather
> >>>> +	 * than going via journal.
> >>>> +	 */
> >>>> +	sbi->s_journal_destorying = true;
> >>>> +
> >>>
> >>> Hi, Ojaswin!
> >>>
> >>> I'm afraid you still need to flush the superblock update work here,
> >>> otherwise I guess the race condition you mentioned in v1 could still
> >>> occur.
> >>>
> >>>  ext4_put_super()
> >>>   flush_work(&sbi->s_sb_upd_work)
> >>>
> >>>                     **kjournald2**
> >>>                     jbd2_journal_commit_transaction()
> >>>                     ...
> >>>                     ext4_inode_error()
> >>>                       /* JBD2_UNMOUNT not set */
> >>>                       schedule_work(s_sb_upd_work)
> >>>
> >>>                                   **workqueue**
> >>>                                    update_super_work
> >>>                                    /* s_journal_destorying is not set */
> >>>                             	   if (journal && !s_journal_destorying)
> >>>
> >>>   ext4_journal_destroy()
> >>>    /* set s_journal_destorying */
> >>>    sbi->s_journal_destorying = true;
> >>>    jbd2_journal_destroy()
> >>>     journal->j_flags |= JBD2_UNMOUNT;
> >>>
> >>>                                        jbd2_journal_start()
> >>>                                         start_this_handle()
> >>>                                           BUG_ON(JBD2_UNMOUNT)
> >>>
> >>> Thanks,
> >>> Yi.
> >> Hi Yi,
> >>
> >> Yes you are right, somehow missed this edge case :(
> >>
> >> Alright then, we have to move out sbi->s_journal_destroying outside the
> >> helper. Just wondering if I should still let it be in
> >> ext4_journal_destroy and just add an extra s_journal_destroying = false
> >> before schedule_work(s_sb_upd_work), because it makes sense.
> >>
> >> Okay let me give it some thought but thanks for pointing this out!
> >>
> >> Regards,
> >> ojaswin
> > 
> > Okay so thinking about it a bit more, I see you also suggested to flush
> > the work after marking sbi->s_journal_destroying. But will that solve
> > it?
> > 
> >   ext4_put_super()
> >    flush_work(&sbi->s_sb_upd_work)
> >  
> >                      **kjournald2**
> >                      jbd2_journal_commit_transaction()
> >                      ...
> >                      ext4_inode_error()
> >                        /* JBD2_UNMOUNT not set */
> >                        schedule_work(s_sb_upd_work)
> >  
> >                                     **workqueue**
> >                                     update_super_work
> >                                     /* s_journal_destorying is not set */
> >                              	      if (journal && !s_journal_destorying)
> >  
> >    ext4_journal_destroy()
> >     /* set s_journal_destorying */
> >     sbi->s_journal_destorying = true;
> >     flush_work(&sbi->s_sb_upd_work)
> >                                       schedule_work()
>                                         ^^^^^^^^^^^^^^^
>                                         where does this come from?
> 
> After this flush_work, we can guarantee that the running s_sb_upd_work
> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> not commit transaction or call schedule_work() again because it has
> been aborted due to the previous error. Am I missing something?
> 
> Thanks,
> Yi.

Hmm, so I am thinking of a corner case in ext4_handle_error() where 

 if(journal && !is_journal_destroying) 

is computed but schedule_work() not called yet, which is possible cause
the cmp followed by jump is not atomic in nature. If the schedule_work
is only called after we have done the flush then we end up with this:

                              	      if (journal && !s_journal_destorying)
    ext4_journal_destroy()
     /* set s_journal_destorying */
     sbi->s_journal_destorying = true;
     flush_work(&sbi->s_sb_upd_work)
                                       schedule_work()

Which is possible IMO, although the window is tiny.

Regards,
ojaswin

> 
> >     jbd2_journal_destroy()
> >      journal->j_flags |= JBD2_UNMOUNT;
> >  
> >                                         jbd2_journal_start()
> >                                          start_this_handle()
> >                                            BUG_ON(JBD2_UNMOUNT)
> > 
> > 
> > Seems like these edge cases keep sprouting up :)
> > 
> > As for the fix, how about we do something like this:
> > 
> >   ext4_put_super()
> > 
> >    flush_work(&sbi->s_sb_upd_work)
> >    destroy_workqueue(sbi->rsv_conversion_wq);
> > 
> >    ext4_journal_destroy()
> >     /* set s_journal_destorying */
> >     sbi->s_journal_destorying = true;
> > 
> >    /* trigger a commit and wait for it to complete */
> > 
> >     flush_work(&sbi->s_sb_upd_work)
> > 
> >     jbd2_journal_destroy()
> >      journal->j_flags |= JBD2_UNMOUNT;
> >  
> >                                         jbd2_journal_start()
> >                                          start_this_handle()
> >                                            BUG_ON(JBD2_UNMOUNT)
> > 
> > Still giving this codepath some thought but seems like this might just
> > be enough to fix the race. Thoughts on this?
> > 
> > Regards,
> > ojaswin
> > 
> >>>
> >>>>  	err = jbd2_journal_destroy(journal);
> >>>>  	sbi->s_journal = NULL;
> >>>>  
> >>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >>>> index 8ad664d47806..31552cf0519a 100644
> >>>> --- a/fs/ext4/super.c
> >>>> +++ b/fs/ext4/super.c
> >>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >>>>  		 * constraints, it may not be safe to do it right here so we
> >>>>  		 * defer superblock flushing to a workqueue.
> >>>>  		 */
> >>>> -		if (continue_fs && journal)
> >>>> +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
> >>>>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >>>>  		else
> >>>>  			ext4_commit_super(sb);
> >>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> >>>>  	spin_lock_init(&sbi->s_error_lock);
> >>>>  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
> >>>>  
> >>>> +	sbi->s_journal_destorying = false;
> >>>> +
> >>>>  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
> >>>>  	if (err)
> >>>>  		goto failed_mount3;
> >>>
> > 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07 10:27           ` Ojaswin Mujoo
@ 2025-03-07 12:36             ` Zhang Yi
  2025-03-07 17:26               ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Zhang Yi @ 2025-03-07 12:36 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>> destroyed eg:
>>>>>>
>>>>>> (umount)
>>>>>> ext4_kill_sb
>>>>>>   kill_block_super
>>>>>>     generic_shutdown_super
>>>>>>       sync_filesystem /* commits all txns */
>>>>>>       evict_inodes
>>>>>>         /* might start a new txn */
>>>>>>       ext4_put_super
>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>         jbd2_journal_destroy
>>>>>>           journal_kill_thread
>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>>>           jbd2_journal_commit_transaction
>>>>>>             jbd2_journal_get_descriptor_buffer
>>>>>>               jbd2_journal_bmap
>>>>>>                 ext4_journal_bmap
>>>>>>                   ext4_map_blocks
>>>>>>                     ...
>>>>>>                     ext4_inode_error
>>>>>>                       ext4_handle_error
>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>>>
>>>>>>                                                /* work queue kicks in */
>>>>>>                                                update_super_work
>>>>>>                                                  jbd2_journal_start
>>>>>>                                                    start_this_handle
>>>>>>                                                      BUG_ON(journal->j_flags &
>>>>>>                                                             JBD2_UNMOUNT)
>>>>>>
>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>
>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>> 2d01ddc86606).
>>>>>>
>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>
>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>>>> ---
>>>>>>  fs/ext4/ext4.h      | 2 ++
>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>  fs/ext4/super.c     | 4 +++-
>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>> --- a/fs/ext4/ext4.h
>>>>>> +++ b/fs/ext4/ext4.h
>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>  	 */
>>>>>>  	struct work_struct s_sb_upd_work;
>>>>>>  
>>>>>> +	bool s_journal_destorying;
>>>>>> +
>>>>>>  	/* Atomic write unit values in bytes */
>>>>>>  	unsigned int s_awu_min;
>>>>>>  	unsigned int s_awu_max;
>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>  {
>>>>>>  	int err = 0;
>>>>>>  
>>>>>> +	/*
>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>> +	 * than going via journal.
>>>>>> +	 */
>>>>>> +	sbi->s_journal_destorying = true;
>>>>>> +
>>>>>
>>>>> Hi, Ojaswin!
>>>>>
>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>> occur.
>>>>>
>>>>>  ext4_put_super()
>>>>>   flush_work(&sbi->s_sb_upd_work)
>>>>>
>>>>>                     **kjournald2**
>>>>>                     jbd2_journal_commit_transaction()
>>>>>                     ...
>>>>>                     ext4_inode_error()
>>>>>                       /* JBD2_UNMOUNT not set */
>>>>>                       schedule_work(s_sb_upd_work)
>>>>>
>>>>>                                   **workqueue**
>>>>>                                    update_super_work
>>>>>                                    /* s_journal_destorying is not set */
>>>>>                             	   if (journal && !s_journal_destorying)
>>>>>
>>>>>   ext4_journal_destroy()
>>>>>    /* set s_journal_destorying */
>>>>>    sbi->s_journal_destorying = true;
>>>>>    jbd2_journal_destroy()
>>>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>>>
>>>>>                                        jbd2_journal_start()
>>>>>                                         start_this_handle()
>>>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>>>
>>>>> Thanks,
>>>>> Yi.
>>>> Hi Yi,
>>>>
>>>> Yes you are right, somehow missed this edge case :(
>>>>
>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>> helper. Just wondering if I should still let it be in
>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>
>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>
>>>> Regards,
>>>> ojaswin
>>>
>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>> the work after marking sbi->s_journal_destroying. But will that solve
>>> it?
>>>
>>>   ext4_put_super()
>>>    flush_work(&sbi->s_sb_upd_work)
>>>  
>>>                      **kjournald2**
>>>                      jbd2_journal_commit_transaction()
>>>                      ...
>>>                      ext4_inode_error()
>>>                        /* JBD2_UNMOUNT not set */
>>>                        schedule_work(s_sb_upd_work)
>>>  
>>>                                     **workqueue**
>>>                                     update_super_work
>>>                                     /* s_journal_destorying is not set */
>>>                              	      if (journal && !s_journal_destorying)
>>>  
>>>    ext4_journal_destroy()
>>>     /* set s_journal_destorying */
>>>     sbi->s_journal_destorying = true;
>>>     flush_work(&sbi->s_sb_upd_work)
>>>                                       schedule_work()
>>                                         ^^^^^^^^^^^^^^^
>>                                         where does this come from?
>>
>> After this flush_work, we can guarantee that the running s_sb_upd_work
>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>> not commit transaction or call schedule_work() again because it has
>> been aborted due to the previous error. Am I missing something?
>>
>> Thanks,
>> Yi.
> 
> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
> 
>  if(journal && !is_journal_destroying) 
> 
> is computed but schedule_work() not called yet, which is possible cause
> the cmp followed by jump is not atomic in nature. If the schedule_work
> is only called after we have done the flush then we end up with this:
> 
>                               	      if (journal && !s_journal_destorying)
>     ext4_journal_destroy()
>      /* set s_journal_destorying */
>      sbi->s_journal_destorying = true;
>      flush_work(&sbi->s_sb_upd_work)
>                                        schedule_work()
> 
> Which is possible IMO, although the window is tiny.

Yeah, right!
Sorry for misread the location where you add the "!s_journal_destorying"
check, the graph I provided was in update_super_work(), which was wrong.
The right one should be:

 ext4_put_super()
  flush_work(&sbi->s_sb_upd_work)

                    **kjournald2**
                    jbd2_journal_commit_transaction()
                    ...
                    ext4_inode_error()
                      /* s_journal_destorying is not set */
                      if (journal && !s_journal_destorying)
                        (schedule_work(s_sb_upd_work))  //can be here

  ext4_journal_destroy()
   /* set s_journal_destorying */
   sbi->s_journal_destorying = true;
   jbd2_journal_destroy()
    journal->j_flags |= JBD2_UNMOUNT;

                        (schedule_work(s_sb_upd_work))  //also can be here

                                  **workqueue**
                                   update_super_work()
                                   journal = sbi->s_journal //get journal
    kfree(journal)
                                     jbd2_journal_start(journal) //journal UAF
                                       start_this_handle()
                                         BUG_ON(JBD2_UNMOUNT) //bugon here


So there are two problems here, the first one is the 'journal' UAF,
the second one is triggering JBD2_UNMOUNT flag BUGON.

>>>
>>> As for the fix, how about we do something like this:
>>>
>>>   ext4_put_super()
>>>
>>>    flush_work(&sbi->s_sb_upd_work)
>>>    destroy_workqueue(sbi->rsv_conversion_wq);
>>>
>>>    ext4_journal_destroy()
>>>     /* set s_journal_destorying */
>>>     sbi->s_journal_destorying = true;
>>>
>>>    /* trigger a commit and wait for it to complete */
>>>
>>>     flush_work(&sbi->s_sb_upd_work)
>>>
>>>     jbd2_journal_destroy()
>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>  
>>>                                         jbd2_journal_start()
>>>                                          start_this_handle()
>>>                                            BUG_ON(JBD2_UNMOUNT)
>>>
>>> Still giving this codepath some thought but seems like this might just
>>> be enough to fix the race. Thoughts on this?
>>>

I think this solution should work, the forced commit and flush_work()
should ensure that the last transaction is committed and that the
potential work is done.

Besides, the s_journal_destorying flag is set and check concurrently
now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
about adding a new flag into sbi->s_mount_state instead of adding
new s_journal_destorying?

Thanks,
Yi.

>>>
>>>>>
>>>>>>  	err = jbd2_journal_destroy(journal);
>>>>>>  	sbi->s_journal = NULL;
>>>>>>  
>>>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>>>> index 8ad664d47806..31552cf0519a 100644
>>>>>> --- a/fs/ext4/super.c
>>>>>> +++ b/fs/ext4/super.c
>>>>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>>>>>>  		 * constraints, it may not be safe to do it right here so we
>>>>>>  		 * defer superblock flushing to a workqueue.
>>>>>>  		 */
>>>>>> -		if (continue_fs && journal)
>>>>>> +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>>>>>>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>>>>>>  		else
>>>>>>  			ext4_commit_super(sb);
>>>>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>>>>>>  	spin_lock_init(&sbi->s_error_lock);
>>>>>>  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>>>>>>  
>>>>>> +	sbi->s_journal_destorying = false;
>>>>>> +
>>>>>>  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>>>>>>  	if (err)
>>>>>>  		goto failed_mount3;
>>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper
  2025-03-06 14:28 ` [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper Ojaswin Mujoo
@ 2025-03-07 14:05   ` Jan Kara
  2025-03-08  1:18   ` Baokun Li
  1 sibling, 0 replies; 41+ messages in thread
From: Jan Kara @ 2025-03-07 14:05 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel

On Thu 06-03-25 19:58:32, Ojaswin Mujoo wrote:
> Define an ext4 wrapper over jbd2_journal_destroy to make sure we
> have consistent behavior during journal destruction. This will also
> come useful in the next patch where we add some ext4 specific logic
> in the destroy path.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4_jbd2.h | 14 ++++++++++++++
>  fs/ext4/super.c     | 16 ++++++----------
>  2 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 3f2596c9e5f2..9b3c9df02a39 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -429,4 +429,18 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>  	return 1;
>  }
>  
> +/*
> + * Pass journal explicitly as it may not be cached in the sbi->s_journal in some
> + * cases
> + */
> +static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *journal)
> +{
> +	int err = 0;
> +
> +	err = jbd2_journal_destroy(journal);
> +	sbi->s_journal = NULL;
> +
> +	return err;
> +}
> +
>  #endif	/* _EXT4_JBD2_H */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a963ffda692a..8ad664d47806 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1297,8 +1297,7 @@ static void ext4_put_super(struct super_block *sb)
>  
>  	if (sbi->s_journal) {
>  		aborted = is_journal_aborted(sbi->s_journal);
> -		err = jbd2_journal_destroy(sbi->s_journal);
> -		sbi->s_journal = NULL;
> +		err = ext4_journal_destroy(sbi, sbi->s_journal);
>  		if ((err < 0) && !aborted) {
>  			ext4_abort(sb, -err, "Couldn't clean up the journal");
>  		}
> @@ -4960,8 +4959,7 @@ static int ext4_load_and_init_journal(struct super_block *sb,
>  out:
>  	/* flush s_sb_upd_work before destroying the journal. */
>  	flush_work(&sbi->s_sb_upd_work);
> -	jbd2_journal_destroy(sbi->s_journal);
> -	sbi->s_journal = NULL;
> +	ext4_journal_destroy(sbi, sbi->s_journal);
>  	return -EINVAL;
>  }
>  
> @@ -5652,8 +5650,7 @@ failed_mount8: __maybe_unused
>  	if (sbi->s_journal) {
>  		/* flush s_sb_upd_work before journal destroy. */
>  		flush_work(&sbi->s_sb_upd_work);
> -		jbd2_journal_destroy(sbi->s_journal);
> -		sbi->s_journal = NULL;
> +		ext4_journal_destroy(sbi, sbi->s_journal);
>  	}
>  failed_mount3a:
>  	ext4_es_unregister_shrinker(sbi);
> @@ -5958,7 +5955,7 @@ static journal_t *ext4_open_dev_journal(struct super_block *sb,
>  	return journal;
>  
>  out_journal:
> -	jbd2_journal_destroy(journal);
> +	ext4_journal_destroy(EXT4_SB(sb), journal);
>  out_bdev:
>  	bdev_fput(bdev_file);
>  	return ERR_PTR(errno);
> @@ -6075,8 +6072,7 @@ static int ext4_load_journal(struct super_block *sb,
>  	EXT4_SB(sb)->s_journal = journal;
>  	err = ext4_clear_journal_err(sb, es);
>  	if (err) {
> -		EXT4_SB(sb)->s_journal = NULL;
> -		jbd2_journal_destroy(journal);
> +		ext4_journal_destroy(EXT4_SB(sb), journal);
>  		return err;
>  	}
>  
> @@ -6094,7 +6090,7 @@ static int ext4_load_journal(struct super_block *sb,
>  	return 0;
>  
>  err_out:
> -	jbd2_journal_destroy(journal);
> +	ext4_journal_destroy(EXT4_SB(sb), journal);
>  	return err;
>  }
>  
> -- 
> 2.48.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
  2025-03-07  2:49   ` Zhang Yi
@ 2025-03-07 14:26   ` Jan Kara
  2025-03-07 17:00     ` Ojaswin Mujoo
  2025-03-08  9:55   ` Ritesh Harjani (IBM)
  2025-03-12 13:57   ` Eric Sandeen
  3 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2025-03-07 14:26 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

On Thu 06-03-25 19:58:33, Ojaswin Mujoo wrote:
> Presently we always BUG_ON if trying to start a transaction on a journal marked
> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> stress tests, it was observed that in case of some error handling paths, it is
> possible for update_super_work to start a transaction after the journal is
> destroyed eg:
> 
> (umount)
> ext4_kill_sb
>   kill_block_super
>     generic_shutdown_super
>       sync_filesystem /* commits all txns */
>       evict_inodes
>         /* might start a new txn */
>       ext4_put_super
> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>         jbd2_journal_destroy
>           journal_kill_thread
>             journal->j_flags |= JBD2_UNMOUNT;
>           jbd2_journal_commit_transaction
>             jbd2_journal_get_descriptor_buffer
>               jbd2_journal_bmap
>                 ext4_journal_bmap
>                   ext4_map_blocks
>                     ...
>                     ext4_inode_error
>                       ext4_handle_error
>                         schedule_work(&sbi->s_sb_upd_work)
> 
>                                                /* work queue kicks in */
>                                                update_super_work
>                                                  jbd2_journal_start
>                                                    start_this_handle
>                                                      BUG_ON(journal->j_flags &
>                                                             JBD2_UNMOUNT)
> 
> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> destroying only do a journaled (and deferred) update of sb if this flag is not
> set. Otherwise, just fallback to an un-journaled commit.
> 
> We set sbi->s_journal_destroying = true only after all the FS updates are done
> during ext4_put_super() (except a running transaction that will get commited
> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> outside the journal as it won't race with a journaled update (refer
> 2d01ddc86606).
> 
> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> called from mballoc and AFAICT it would be always valid to schedule work here.
> 
> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/ext4/ext4.h      | 2 ++
>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>  fs/ext4/super.c     | 4 +++-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2b7d781bfcad..d48e93bd5690 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>  	 */
>  	struct work_struct s_sb_upd_work;
>  
> +	bool s_journal_destorying;
> +

Not that it would matter much but why not make this a flag in
sbi->s_mount_flags?

> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 9b3c9df02a39..6bd3ca84410d 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>  {
>  	int err = 0;
>  
> +	/*
> +	 * At this point all pending FS updates should be done except a possible
> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> +	 * is now safe for any new errors to directly commit superblock rather
> +	 * than going via journal.
> +	 */
> +	sbi->s_journal_destorying = true;
> +

So as you already uncovered with Zhang Yi, this does not work. What I meant
was that we move flush_work(&sbi->s_sb_upd_work) into
ext4_journal_destroy() and set s_journal_destorying *before* calling
flush_work(). By the time ext4_journal_destroy() gets called, the
filesystem is quiescent, there cannot be new handles started (except for sb
update itself from the workqueue) and thus if we hit some error, the
journal will be aborted anyway and in that case non-journaled sb update is
safe.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07 14:26   ` Jan Kara
@ 2025-03-07 17:00     ` Ojaswin Mujoo
  0 siblings, 0 replies; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-07 17:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Baokun Li, linux-kernel,
	Mahesh Kumar

On Fri, Mar 07, 2025 at 03:26:54PM +0100, Jan Kara wrote:
> On Thu 06-03-25 19:58:33, Ojaswin Mujoo wrote:
> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > stress tests, it was observed that in case of some error handling paths, it is
> > possible for update_super_work to start a transaction after the journal is
> > destroyed eg:
> > 
> > (umount)
> > ext4_kill_sb
> >   kill_block_super
> >     generic_shutdown_super
> >       sync_filesystem /* commits all txns */
> >       evict_inodes
> >         /* might start a new txn */
> >       ext4_put_super
> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >         jbd2_journal_destroy
> >           journal_kill_thread
> >             journal->j_flags |= JBD2_UNMOUNT;
> >           jbd2_journal_commit_transaction
> >             jbd2_journal_get_descriptor_buffer
> >               jbd2_journal_bmap
> >                 ext4_journal_bmap
> >                   ext4_map_blocks
> >                     ...
> >                     ext4_inode_error
> >                       ext4_handle_error
> >                         schedule_work(&sbi->s_sb_upd_work)
> > 
> >                                                /* work queue kicks in */
> >                                                update_super_work
> >                                                  jbd2_journal_start
> >                                                    start_this_handle
> >                                                      BUG_ON(journal->j_flags &
> >                                                             JBD2_UNMOUNT)
> > 
> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > destroying only do a journaled (and deferred) update of sb if this flag is not
> > set. Otherwise, just fallback to an un-journaled commit.
> > 
> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > during ext4_put_super() (except a running transaction that will get commited
> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > outside the journal as it won't race with a journaled update (refer
> > 2d01ddc86606).
> > 
> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > called from mballoc and AFAICT it would be always valid to schedule work here.
> > 
> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > ---
> >  fs/ext4/ext4.h      | 2 ++
> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >  fs/ext4/super.c     | 4 +++-
> >  3 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 2b7d781bfcad..d48e93bd5690 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >  	 */
> >  	struct work_struct s_sb_upd_work;
> >  
> > +	bool s_journal_destorying;
> > +
> 
> Not that it would matter much but why not make this a flag in
> sbi->s_mount_flags?

Noted.

> 
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 9b3c9df02a39..6bd3ca84410d 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >  {
> >  	int err = 0;
> >  
> > +	/*
> > +	 * At this point all pending FS updates should be done except a possible
> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > +	 * is now safe for any new errors to directly commit superblock rather
> > +	 * than going via journal.
> > +	 */
> > +	sbi->s_journal_destorying = true;
> > +
> 
> So as you already uncovered with Zhang Yi, this does not work. What I meant
> was that we move flush_work(&sbi->s_sb_upd_work) into
> ext4_journal_destroy() and set s_journal_destorying *before* calling
> flush_work(). By the time ext4_journal_destroy() gets called, the
> filesystem is quiescent, there cannot be new handles started (except for sb
> update itself from the workqueue) and thus if we hit some error, the
> journal will be aborted anyway and in that case non-journaled sb update is
> safe.

I missed that in my patch, however as in the discussion [1],
even:

ext4_journal_destroy
  sbi->s_journal_destroying = true
  flush_work()

sequence is not enough. Zhang and I were discussing that we might need
to force and wait for commit as well before flushing the work.
Hopefully, with that, we should be covering all the possible edge cases.

[1] https://lore.kernel.org/linux-ext4/cover.1741270780.git.ojaswin@linux.ibm.com/T/#mc8046d47b357665bdbd2878c91e51eb660f94b3e

Regards,
ojaswin

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07 12:36             ` Zhang Yi
@ 2025-03-07 17:26               ` Ojaswin Mujoo
  2025-03-08  2:57                 ` Zhang Yi
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-07 17:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> > On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> >> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> >>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> >>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> >>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> >>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
> >>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >>>>>> stress tests, it was observed that in case of some error handling paths, it is
> >>>>>> possible for update_super_work to start a transaction after the journal is
> >>>>>> destroyed eg:
> >>>>>>
> >>>>>> (umount)
> >>>>>> ext4_kill_sb
> >>>>>>   kill_block_super
> >>>>>>     generic_shutdown_super
> >>>>>>       sync_filesystem /* commits all txns */
> >>>>>>       evict_inodes
> >>>>>>         /* might start a new txn */
> >>>>>>       ext4_put_super
> >>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >>>>>>         jbd2_journal_destroy
> >>>>>>           journal_kill_thread
> >>>>>>             journal->j_flags |= JBD2_UNMOUNT;
> >>>>>>           jbd2_journal_commit_transaction
> >>>>>>             jbd2_journal_get_descriptor_buffer
> >>>>>>               jbd2_journal_bmap
> >>>>>>                 ext4_journal_bmap
> >>>>>>                   ext4_map_blocks
> >>>>>>                     ...
> >>>>>>                     ext4_inode_error
> >>>>>>                       ext4_handle_error
> >>>>>>                         schedule_work(&sbi->s_sb_upd_work)
> >>>>>>
> >>>>>>                                                /* work queue kicks in */
> >>>>>>                                                update_super_work
> >>>>>>                                                  jbd2_journal_start
> >>>>>>                                                    start_this_handle
> >>>>>>                                                      BUG_ON(journal->j_flags &
> >>>>>>                                                             JBD2_UNMOUNT)
> >>>>>>
> >>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
> >>>>>> set. Otherwise, just fallback to an un-journaled commit.
> >>>>>>
> >>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
> >>>>>> during ext4_put_super() (except a running transaction that will get commited
> >>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >>>>>> outside the journal as it won't race with a journaled update (refer
> >>>>>> 2d01ddc86606).
> >>>>>>
> >>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
> >>>>>>
> >>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> >>>>>> Suggested-by: Jan Kara <jack@suse.cz>
> >>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >>>>>> ---
> >>>>>>  fs/ext4/ext4.h      | 2 ++
> >>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >>>>>>  fs/ext4/super.c     | 4 +++-
> >>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>>>>> index 2b7d781bfcad..d48e93bd5690 100644
> >>>>>> --- a/fs/ext4/ext4.h
> >>>>>> +++ b/fs/ext4/ext4.h
> >>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >>>>>>  	 */
> >>>>>>  	struct work_struct s_sb_upd_work;
> >>>>>>  
> >>>>>> +	bool s_journal_destorying;
> >>>>>> +
> >>>>>>  	/* Atomic write unit values in bytes */
> >>>>>>  	unsigned int s_awu_min;
> >>>>>>  	unsigned int s_awu_max;
> >>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
> >>>>>> --- a/fs/ext4/ext4_jbd2.h
> >>>>>> +++ b/fs/ext4/ext4_jbd2.h
> >>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >>>>>>  {
> >>>>>>  	int err = 0;
> >>>>>>  
> >>>>>> +	/*
> >>>>>> +	 * At this point all pending FS updates should be done except a possible
> >>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> >>>>>> +	 * is now safe for any new errors to directly commit superblock rather
> >>>>>> +	 * than going via journal.
> >>>>>> +	 */
> >>>>>> +	sbi->s_journal_destorying = true;
> >>>>>> +
> >>>>>
> >>>>> Hi, Ojaswin!
> >>>>>
> >>>>> I'm afraid you still need to flush the superblock update work here,
> >>>>> otherwise I guess the race condition you mentioned in v1 could still
> >>>>> occur.
> >>>>>
> >>>>>  ext4_put_super()
> >>>>>   flush_work(&sbi->s_sb_upd_work)
> >>>>>
> >>>>>                     **kjournald2**
> >>>>>                     jbd2_journal_commit_transaction()
> >>>>>                     ...
> >>>>>                     ext4_inode_error()
> >>>>>                       /* JBD2_UNMOUNT not set */
> >>>>>                       schedule_work(s_sb_upd_work)
> >>>>>
> >>>>>                                   **workqueue**
> >>>>>                                    update_super_work
> >>>>>                                    /* s_journal_destorying is not set */
> >>>>>                             	   if (journal && !s_journal_destorying)
> >>>>>
> >>>>>   ext4_journal_destroy()
> >>>>>    /* set s_journal_destorying */
> >>>>>    sbi->s_journal_destorying = true;
> >>>>>    jbd2_journal_destroy()
> >>>>>     journal->j_flags |= JBD2_UNMOUNT;
> >>>>>
> >>>>>                                        jbd2_journal_start()
> >>>>>                                         start_this_handle()
> >>>>>                                           BUG_ON(JBD2_UNMOUNT)
> >>>>>
> >>>>> Thanks,
> >>>>> Yi.
> >>>> Hi Yi,
> >>>>
> >>>> Yes you are right, somehow missed this edge case :(
> >>>>
> >>>> Alright then, we have to move out sbi->s_journal_destroying outside the
> >>>> helper. Just wondering if I should still let it be in
> >>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
> >>>> before schedule_work(s_sb_upd_work), because it makes sense.
> >>>>
> >>>> Okay let me give it some thought but thanks for pointing this out!
> >>>>
> >>>> Regards,
> >>>> ojaswin
> >>>
> >>> Okay so thinking about it a bit more, I see you also suggested to flush
> >>> the work after marking sbi->s_journal_destroying. But will that solve
> >>> it?
> >>>
> >>>   ext4_put_super()
> >>>    flush_work(&sbi->s_sb_upd_work)
> >>>  
> >>>                      **kjournald2**
> >>>                      jbd2_journal_commit_transaction()
> >>>                      ...
> >>>                      ext4_inode_error()
> >>>                        /* JBD2_UNMOUNT not set */
> >>>                        schedule_work(s_sb_upd_work)
> >>>  
> >>>                                     **workqueue**
> >>>                                     update_super_work
> >>>                                     /* s_journal_destorying is not set */
> >>>                              	      if (journal && !s_journal_destorying)
> >>>  
> >>>    ext4_journal_destroy()
> >>>     /* set s_journal_destorying */
> >>>     sbi->s_journal_destorying = true;
> >>>     flush_work(&sbi->s_sb_upd_work)
> >>>                                       schedule_work()
> >>                                         ^^^^^^^^^^^^^^^
> >>                                         where does this come from?
> >>
> >> After this flush_work, we can guarantee that the running s_sb_upd_work
> >> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> >> not commit transaction or call schedule_work() again because it has
> >> been aborted due to the previous error. Am I missing something?
> >>
> >> Thanks,
> >> Yi.
> > 
> > Hmm, so I am thinking of a corner case in ext4_handle_error() where 
> > 
> >  if(journal && !is_journal_destroying) 
> > 
> > is computed but schedule_work() not called yet, which is possible cause
> > the cmp followed by jump is not atomic in nature. If the schedule_work
> > is only called after we have done the flush then we end up with this:
> > 
> >                               	      if (journal && !s_journal_destorying)
> >     ext4_journal_destroy()
> >      /* set s_journal_destorying */
> >      sbi->s_journal_destorying = true;
> >      flush_work(&sbi->s_sb_upd_work)
> >                                        schedule_work()
> > 
> > Which is possible IMO, although the window is tiny.
> 
> Yeah, right!
> Sorry for misread the location where you add the "!s_journal_destorying"
> check, the graph I provided was in update_super_work(), which was wrong.

Oh right, I also misread your trace but yes as discussed, even 

    sbi->s_journal_destorying = true;
		flush_work()
    jbd2_journal_destroy()

doesn't work.

> The right one should be:
> 
>  ext4_put_super()
>   flush_work(&sbi->s_sb_upd_work)
> 
>                     **kjournald2**
>                     jbd2_journal_commit_transaction()
>                     ...
>                     ext4_inode_error()
>                       /* s_journal_destorying is not set */
>                       if (journal && !s_journal_destorying)
>                         (schedule_work(s_sb_upd_work))  //can be here
> 
>   ext4_journal_destroy()
>    /* set s_journal_destorying */
>    sbi->s_journal_destorying = true;
>    jbd2_journal_destroy()
>     journal->j_flags |= JBD2_UNMOUNT;
> 
>                         (schedule_work(s_sb_upd_work))  //also can be here
> 
>                                   **workqueue**
>                                    update_super_work()
>                                    journal = sbi->s_journal //get journal
>     kfree(journal)
>                                      jbd2_journal_start(journal) //journal UAF
>                                        start_this_handle()
>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
> 
> 
> So there are two problems here, the first one is the 'journal' UAF,
> the second one is triggering JBD2_UNMOUNT flag BUGON.

Indeed, there's a possible UAF here as well.

> 
> >>>
> >>> As for the fix, how about we do something like this:
> >>>
> >>>   ext4_put_super()
> >>>
> >>>    flush_work(&sbi->s_sb_upd_work)
> >>>    destroy_workqueue(sbi->rsv_conversion_wq);
> >>>
> >>>    ext4_journal_destroy()
> >>>     /* set s_journal_destorying */
> >>>     sbi->s_journal_destorying = true;
> >>>
> >>>    /* trigger a commit and wait for it to complete */
> >>>
> >>>     flush_work(&sbi->s_sb_upd_work)
> >>>
> >>>     jbd2_journal_destroy()
> >>>      journal->j_flags |= JBD2_UNMOUNT;
> >>>  
> >>>                                         jbd2_journal_start()
> >>>                                          start_this_handle()
> >>>                                            BUG_ON(JBD2_UNMOUNT)
> >>>
> >>> Still giving this codepath some thought but seems like this might just
> >>> be enough to fix the race. Thoughts on this?
> >>>
> 
> I think this solution should work, the forced commit and flush_work()
> should ensure that the last transaction is committed and that the
> potential work is done.
> 
> Besides, the s_journal_destorying flag is set and check concurrently
> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> about adding a new flag into sbi->s_mount_state instead of adding
> new s_journal_destorying?

Right, that makes sence. I will incorporate these changes in the next 
revision.

Thanks for the review,
ojaswin

> 
> Thanks,
> Yi.
> 
> >>>
> >>>>>
> >>>>>>  	err = jbd2_journal_destroy(journal);
> >>>>>>  	sbi->s_journal = NULL;
> >>>>>>  
> >>>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >>>>>> index 8ad664d47806..31552cf0519a 100644
> >>>>>> --- a/fs/ext4/super.c
> >>>>>> +++ b/fs/ext4/super.c
> >>>>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >>>>>>  		 * constraints, it may not be safe to do it right here so we
> >>>>>>  		 * defer superblock flushing to a workqueue.
> >>>>>>  		 */
> >>>>>> -		if (continue_fs && journal)
> >>>>>> +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
> >>>>>>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >>>>>>  		else
> >>>>>>  			ext4_commit_super(sb);
> >>>>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> >>>>>>  	spin_lock_init(&sbi->s_error_lock);
> >>>>>>  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
> >>>>>>  
> >>>>>> +	sbi->s_journal_destorying = false;
> >>>>>> +
> >>>>>>  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
> >>>>>>  	if (err)
> >>>>>>  		goto failed_mount3;
> >>>>>
> >>>
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle
  2025-03-06 14:28 [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
                   ` (2 preceding siblings ...)
  2025-03-06 14:28 ` [PATCH v2 3/3] ext4: Make sb update interval tunable Ojaswin Mujoo
@ 2025-03-08  1:09 ` Baokun Li
  2025-03-08 13:07   ` Ojaswin Mujoo
  3 siblings, 1 reply; 41+ messages in thread
From: Baokun Li @ 2025-03-08  1:09 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel, Yang Erkun

On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> ** Changes since v1 [1] **
>
>   * Picked up RVBs from Jan and Ritesh
>   * In patch 2/3, we now use a flag in sbi instead of SB_ACITVE
>     to determine when to journal sb vs when to commit directly.
>   * Added a prep patch 1/3
>
> [1] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#m5e659425b8c8fe2ac01e7242b77fed315ff89db4
>
> @Baokun, I didn't get a chance to look into the journal_inode
> modifications we were discussing in [2]. I'll try to spend some time and
> send that as a separate patch. Hope that's okay
>
> [2] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#mad8feb44d9b6ddadf87830b92caa7b78d902dc05
>   
That's fine, it's not a priority. And if this patch set makes sure we
don't crash when things go wrong, I'm okay with leaving it as is.

It's possible that jbd2_journal_commit_transaction() could call
ext4_handle_error() in other places as the code evolves. Fixing known
problems and protecting against potential ones is always a good thing.


Cheers,
Baokun
> ** Original Cover **
>
> When running LTP stress tests on ext4, after a multiday run we seemed to
> have hit the following BUG_ON:
>
>   [NIP  : start_this_handle+268]
>   #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2]  (unreliable)
>   #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
>   #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
>   #6 [c000001067c27c70] process_one_work at c000000000188790
>   #7 [c000001067c27d20] worker_thread at c00000000018973c
>   #8 [c000001067c27dc0] kthread at c000000000196c84
>   #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64
>
> Which comes out to
>
>    382   repeat:
>    383           read_lock(&journal->j_state_lock);
> * 384           BUG_ON(journal->j_flags & JBD2_UNMOUNT);
>    385           if (is_journal_aborted(journal) ||
>    386               (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
>    387                   read_unlock(&journal->j_state_lock);
>
>
> Initially this seemed like it should never happen but upon crash
> analysis it seems like it could indeed be hit as described in patch 1/2.
>
> I would like to add that through the logs we only knew that:
>
> - ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
> - update_super_work had hit the BUG_ON
>
> I was not able to hit this bug again (without modifying the kernel to
> inject errors) but the above backtrace seems to be one possible paths
> where this BUG_ON can be hit. Rest of the analysis and fix is in patch
> 2/3. Patch 3 is just a small tweak that i found helpful while debugging.
>
> That being said, journalling is something I'm not very familiar with and
> there might be gaps in my understanding so thoughts and suggestions are
> welcome.
>
> Ojaswin Mujoo (3):
>    ext4: define ext4_journal_destroy wrapper
>    ext4: avoid journaling sb update on error if journal is destroying
>    ext4: Make sb update interval tunable
>
>   fs/ext4/ext4.h      | 11 +++++++++++
>   fs/ext4/ext4_jbd2.h | 22 ++++++++++++++++++++++
>   fs/ext4/super.c     | 35 +++++++++++++++++------------------
>   fs/ext4/sysfs.c     |  4 ++++
>   4 files changed, 54 insertions(+), 18 deletions(-)
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper
  2025-03-06 14:28 ` [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper Ojaswin Mujoo
  2025-03-07 14:05   ` Jan Kara
@ 2025-03-08  1:18   ` Baokun Li
  1 sibling, 0 replies; 41+ messages in thread
From: Baokun Li @ 2025-03-08  1:18 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel, Yang Erkun

On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> Define an ext4 wrapper over jbd2_journal_destroy to make sure we
> have consistent behavior during journal destruction. This will also
> come useful in the next patch where we add some ext4 specific logic
> in the destroy path.
>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Looks good to me.

Reviewed-by: Baokun Li <libaokun1@huawei.com>
> ---
>   fs/ext4/ext4_jbd2.h | 14 ++++++++++++++
>   fs/ext4/super.c     | 16 ++++++----------
>   2 files changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 3f2596c9e5f2..9b3c9df02a39 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -429,4 +429,18 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>   	return 1;
>   }
>   
> +/*
> + * Pass journal explicitly as it may not be cached in the sbi->s_journal in some
> + * cases
> + */
> +static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *journal)
> +{
> +	int err = 0;
> +
> +	err = jbd2_journal_destroy(journal);
> +	sbi->s_journal = NULL;
> +
> +	return err;
> +}
> +
>   #endif	/* _EXT4_JBD2_H */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a963ffda692a..8ad664d47806 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1297,8 +1297,7 @@ static void ext4_put_super(struct super_block *sb)
>   
>   	if (sbi->s_journal) {
>   		aborted = is_journal_aborted(sbi->s_journal);
> -		err = jbd2_journal_destroy(sbi->s_journal);
> -		sbi->s_journal = NULL;
> +		err = ext4_journal_destroy(sbi, sbi->s_journal);
>   		if ((err < 0) && !aborted) {
>   			ext4_abort(sb, -err, "Couldn't clean up the journal");
>   		}
> @@ -4960,8 +4959,7 @@ static int ext4_load_and_init_journal(struct super_block *sb,
>   out:
>   	/* flush s_sb_upd_work before destroying the journal. */
>   	flush_work(&sbi->s_sb_upd_work);
> -	jbd2_journal_destroy(sbi->s_journal);
> -	sbi->s_journal = NULL;
> +	ext4_journal_destroy(sbi, sbi->s_journal);
>   	return -EINVAL;
>   }
>   
> @@ -5652,8 +5650,7 @@ failed_mount8: __maybe_unused
>   	if (sbi->s_journal) {
>   		/* flush s_sb_upd_work before journal destroy. */
>   		flush_work(&sbi->s_sb_upd_work);
> -		jbd2_journal_destroy(sbi->s_journal);
> -		sbi->s_journal = NULL;
> +		ext4_journal_destroy(sbi, sbi->s_journal);
>   	}
>   failed_mount3a:
>   	ext4_es_unregister_shrinker(sbi);
> @@ -5958,7 +5955,7 @@ static journal_t *ext4_open_dev_journal(struct super_block *sb,
>   	return journal;
>   
>   out_journal:
> -	jbd2_journal_destroy(journal);
> +	ext4_journal_destroy(EXT4_SB(sb), journal);
>   out_bdev:
>   	bdev_fput(bdev_file);
>   	return ERR_PTR(errno);
> @@ -6075,8 +6072,7 @@ static int ext4_load_journal(struct super_block *sb,
>   	EXT4_SB(sb)->s_journal = journal;
>   	err = ext4_clear_journal_err(sb, es);
>   	if (err) {
> -		EXT4_SB(sb)->s_journal = NULL;
> -		jbd2_journal_destroy(journal);
> +		ext4_journal_destroy(EXT4_SB(sb), journal);
>   		return err;
>   	}
>   
> @@ -6094,7 +6090,7 @@ static int ext4_load_journal(struct super_block *sb,
>   	return 0;
>   
>   err_out:
> -	jbd2_journal_destroy(journal);
> +	ext4_journal_destroy(EXT4_SB(sb), journal);
>   	return err;
>   }
>   



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/3] ext4: Make sb update interval tunable
  2025-03-06 14:28 ` [PATCH v2 3/3] ext4: Make sb update interval tunable Ojaswin Mujoo
@ 2025-03-08  2:25   ` Baokun Li
  0 siblings, 0 replies; 41+ messages in thread
From: Baokun Li @ 2025-03-08  2:25 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Ritesh Harjani (IBM), Yang Erkun

On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> Currently, outside error paths, we auto commit the super block after 1
> hour has passed and 16MB worth of updates have been written since last
> commit. This is a policy decision so make this tunable while keeping the
> defaults same. This is useful if user wants to tweak the superblock
> behavior or for debugging the codepath by allowing to trigger it more
> frequently.
>
> We can now tweak the super block update using sb_update_sec and
> sb_update_kb files in /sys/fs/ext4/<dev>/
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Looks good.

Reviewed-by: Baokun Li <libaokun1@huawei.com>
> ---
>   fs/ext4/ext4.h  |  9 +++++++++
>   fs/ext4/super.c | 15 ++++++++-------
>   fs/ext4/sysfs.c |  4 ++++
>   3 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index d48e93bd5690..82c902ed86f3 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1608,6 +1608,8 @@ struct ext4_sb_info {
>   	unsigned int s_mb_prefetch;
>   	unsigned int s_mb_prefetch_limit;
>   	unsigned int s_mb_best_avail_max_trim_order;
> +	unsigned int s_sb_update_sec;
> +	unsigned int s_sb_update_kb;
>   
>   	/* stats for buddy allocator */
>   	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
> @@ -2281,6 +2283,13 @@ static inline int ext4_forced_shutdown(struct super_block *sb)
>   #define EXT4_DEF_MIN_BATCH_TIME	0
>   #define EXT4_DEF_MAX_BATCH_TIME	15000 /* 15ms */
>   
> +/*
> + * Default values for superblock update
> + */
> +#define EXT4_DEF_SB_UPDATE_INTERVAL_SEC (3600) /* seconds (1 hour) */
> +#define EXT4_DEF_SB_UPDATE_INTERVAL_KB (16384) /* kilobytes (16MB) */
> +
> +
>   /*
>    * Minimum number of groups in a flexgroup before we separate out
>    * directories into the first block group of a flexgroup
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 31552cf0519a..1b47b111c583 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -447,9 +447,6 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
>   #define ext4_get_tstamp(es, tstamp) \
>   	__ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
>   
> -#define EXT4_SB_REFRESH_INTERVAL_SEC (3600) /* seconds (1 hour) */
> -#define EXT4_SB_REFRESH_INTERVAL_KB (16384) /* kilobytes (16MB) */
> -
>   /*
>    * The ext4_maybe_update_superblock() function checks and updates the
>    * superblock if needed.
> @@ -457,8 +454,10 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
>    * This function is designed to update the on-disk superblock only under
>    * certain conditions to prevent excessive disk writes and unnecessary
>    * waking of the disk from sleep. The superblock will be updated if:
> - * 1. More than an hour has passed since the last superblock update, and
> - * 2. More than 16MB have been written since the last superblock update.
> + * 1. More than sbi->s_sb_update_sec (def: 1 hour) has passed since the last
> + *    superblock update
> + * 2. More than sbi->s_sb_update_kb (def: 16MB) kbs have been written since the
> + *    last superblock update.
>    *
>    * @sb: The superblock
>    */
> @@ -479,7 +478,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
>   	now = ktime_get_real_seconds();
>   	last_update = ext4_get_tstamp(es, s_wtime);
>   
> -	if (likely(now - last_update < EXT4_SB_REFRESH_INTERVAL_SEC))
> +	if (likely(now - last_update < sbi->s_sb_update_sec))
>   		return;
>   
>   	lifetime_write_kbytes = sbi->s_kbytes_written +
> @@ -494,7 +493,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
>   	 */
>   	diff_size = lifetime_write_kbytes - le64_to_cpu(es->s_kbytes_written);
>   
> -	if (diff_size > EXT4_SB_REFRESH_INTERVAL_KB)
> +	if (diff_size > sbi->s_sb_update_kb)
>   		schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>   }
>   
> @@ -5246,6 +5245,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>   	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
>   	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
>   	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
> +	sbi->s_sb_update_kb = EXT4_DEF_SB_UPDATE_INTERVAL_KB;
> +	sbi->s_sb_update_sec = EXT4_DEF_SB_UPDATE_INTERVAL_SEC;
>   
>   	/*
>   	 * set default s_li_wait_mult for lazyinit, for the case there is
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index ddb54608ca2e..987bd00f916a 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -254,6 +254,8 @@ EXT4_ATTR(journal_task, 0444, journal_task);
>   EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
>   EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
>   EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
> +EXT4_RW_ATTR_SBI_UI(sb_update_sec, s_sb_update_sec);
> +EXT4_RW_ATTR_SBI_UI(sb_update_kb, s_sb_update_kb);
>   
>   static unsigned int old_bump_val = 128;
>   EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
> @@ -305,6 +307,8 @@ static struct attribute *ext4_attrs[] = {
>   	ATTR_LIST(mb_prefetch),
>   	ATTR_LIST(mb_prefetch_limit),
>   	ATTR_LIST(last_trim_minblks),
> +	ATTR_LIST(sb_update_sec),
> +	ATTR_LIST(sb_update_kb),
>   	NULL,
>   };
>   ATTRIBUTE_GROUPS(ext4);



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-07 17:26               ` Ojaswin Mujoo
@ 2025-03-08  2:57                 ` Zhang Yi
  2025-03-08  8:18                   ` Ojaswin Mujoo
  2025-03-08 10:01                   ` Ritesh Harjani
  0 siblings, 2 replies; 41+ messages in thread
From: Zhang Yi @ 2025-03-08  2:57 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>> destroyed eg:
>>>>>>>>
>>>>>>>> (umount)
>>>>>>>> ext4_kill_sb
>>>>>>>>   kill_block_super
>>>>>>>>     generic_shutdown_super
>>>>>>>>       sync_filesystem /* commits all txns */
>>>>>>>>       evict_inodes
>>>>>>>>         /* might start a new txn */
>>>>>>>>       ext4_put_super
>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>         jbd2_journal_destroy
>>>>>>>>           journal_kill_thread
>>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>           jbd2_journal_commit_transaction
>>>>>>>>             jbd2_journal_get_descriptor_buffer
>>>>>>>>               jbd2_journal_bmap
>>>>>>>>                 ext4_journal_bmap
>>>>>>>>                   ext4_map_blocks
>>>>>>>>                     ...
>>>>>>>>                     ext4_inode_error
>>>>>>>>                       ext4_handle_error
>>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>>                                                /* work queue kicks in */
>>>>>>>>                                                update_super_work
>>>>>>>>                                                  jbd2_journal_start
>>>>>>>>                                                    start_this_handle
>>>>>>>>                                                      BUG_ON(journal->j_flags &
>>>>>>>>                                                             JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>
>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>> 2d01ddc86606).
>>>>>>>>
>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>
>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>>>>>> ---
>>>>>>>>  fs/ext4/ext4.h      | 2 ++
>>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>  fs/ext4/super.c     | 4 +++-
>>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>  	 */
>>>>>>>>  	struct work_struct s_sb_upd_work;
>>>>>>>>  
>>>>>>>> +	bool s_journal_destorying;
>>>>>>>> +
>>>>>>>>  	/* Atomic write unit values in bytes */
>>>>>>>>  	unsigned int s_awu_min;
>>>>>>>>  	unsigned int s_awu_max;
>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>  {
>>>>>>>>  	int err = 0;
>>>>>>>>  
>>>>>>>> +	/*
>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>> +	 * than going via journal.
>>>>>>>> +	 */
>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>> +
>>>>>>>
>>>>>>> Hi, Ojaswin!
>>>>>>>
>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>> occur.
>>>>>>>
>>>>>>>  ext4_put_super()
>>>>>>>   flush_work(&sbi->s_sb_upd_work)
>>>>>>>
>>>>>>>                     **kjournald2**
>>>>>>>                     jbd2_journal_commit_transaction()
>>>>>>>                     ...
>>>>>>>                     ext4_inode_error()
>>>>>>>                       /* JBD2_UNMOUNT not set */
>>>>>>>                       schedule_work(s_sb_upd_work)
>>>>>>>
>>>>>>>                                   **workqueue**
>>>>>>>                                    update_super_work
>>>>>>>                                    /* s_journal_destorying is not set */
>>>>>>>                             	   if (journal && !s_journal_destorying)
>>>>>>>
>>>>>>>   ext4_journal_destroy()
>>>>>>>    /* set s_journal_destorying */
>>>>>>>    sbi->s_journal_destorying = true;
>>>>>>>    jbd2_journal_destroy()
>>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>
>>>>>>>                                        jbd2_journal_start()
>>>>>>>                                         start_this_handle()
>>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yi.
>>>>>> Hi Yi,
>>>>>>
>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>
>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>> helper. Just wondering if I should still let it be in
>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>
>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>
>>>>>> Regards,
>>>>>> ojaswin
>>>>>
>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>> it?
>>>>>
>>>>>   ext4_put_super()
>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>  
>>>>>                      **kjournald2**
>>>>>                      jbd2_journal_commit_transaction()
>>>>>                      ...
>>>>>                      ext4_inode_error()
>>>>>                        /* JBD2_UNMOUNT not set */
>>>>>                        schedule_work(s_sb_upd_work)
>>>>>  
>>>>>                                     **workqueue**
>>>>>                                     update_super_work
>>>>>                                     /* s_journal_destorying is not set */
>>>>>                              	      if (journal && !s_journal_destorying)
>>>>>  
>>>>>    ext4_journal_destroy()
>>>>>     /* set s_journal_destorying */
>>>>>     sbi->s_journal_destorying = true;
>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>                                       schedule_work()
>>>>                                         ^^^^^^^^^^^^^^^
>>>>                                         where does this come from?
>>>>
>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>> not commit transaction or call schedule_work() again because it has
>>>> been aborted due to the previous error. Am I missing something?
>>>>
>>>> Thanks,
>>>> Yi.
>>>
>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
>>>
>>>  if(journal && !is_journal_destroying) 
>>>
>>> is computed but schedule_work() not called yet, which is possible cause
>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>> is only called after we have done the flush then we end up with this:
>>>
>>>                               	      if (journal && !s_journal_destorying)
>>>     ext4_journal_destroy()
>>>      /* set s_journal_destorying */
>>>      sbi->s_journal_destorying = true;
>>>      flush_work(&sbi->s_sb_upd_work)
>>>                                        schedule_work()
>>>
>>> Which is possible IMO, although the window is tiny.
>>
>> Yeah, right!
>> Sorry for misread the location where you add the "!s_journal_destorying"
>> check, the graph I provided was in update_super_work(), which was wrong.
> 
> Oh right, I also misread your trace but yes as discussed, even 
> 
>     sbi->s_journal_destorying = true;
> 		flush_work()
>     jbd2_journal_destroy()
> 
> doesn't work.
> 
>> The right one should be:
>>
>>  ext4_put_super()
>>   flush_work(&sbi->s_sb_upd_work)
>>
>>                     **kjournald2**
>>                     jbd2_journal_commit_transaction()
>>                     ...
>>                     ext4_inode_error()
>>                       /* s_journal_destorying is not set */
>>                       if (journal && !s_journal_destorying)
>>                         (schedule_work(s_sb_upd_work))  //can be here
>>
>>   ext4_journal_destroy()
>>    /* set s_journal_destorying */
>>    sbi->s_journal_destorying = true;
>>    jbd2_journal_destroy()
>>     journal->j_flags |= JBD2_UNMOUNT;
>>
>>                         (schedule_work(s_sb_upd_work))  //also can be here
>>
>>                                   **workqueue**
>>                                    update_super_work()
>>                                    journal = sbi->s_journal //get journal
>>     kfree(journal)
>>                                      jbd2_journal_start(journal) //journal UAF
>>                                        start_this_handle()
>>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
>>
>>
>> So there are two problems here, the first one is the 'journal' UAF,
>> the second one is triggering JBD2_UNMOUNT flag BUGON.
> 
> Indeed, there's a possible UAF here as well.
> 
>>
>>>>>
>>>>> As for the fix, how about we do something like this:
>>>>>
>>>>>   ext4_put_super()
>>>>>
>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>
>>>>>    ext4_journal_destroy()
>>>>>     /* set s_journal_destorying */
>>>>>     sbi->s_journal_destorying = true;
>>>>>
>>>>>    /* trigger a commit and wait for it to complete */
>>>>>
>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>
>>>>>     jbd2_journal_destroy()
>>>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>>>  
>>>>>                                         jbd2_journal_start()
>>>>>                                          start_this_handle()
>>>>>                                            BUG_ON(JBD2_UNMOUNT)
>>>>>
>>>>> Still giving this codepath some thought but seems like this might just
>>>>> be enough to fix the race. Thoughts on this?
>>>>>
>>
>> I think this solution should work, the forced commit and flush_work()
>> should ensure that the last transaction is committed and that the
>> potential work is done.
>>
>> Besides, the s_journal_destorying flag is set and check concurrently
>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>> about adding a new flag into sbi->s_mount_state instead of adding
>> new s_journal_destorying?
> 
> Right, that makes sence. I will incorporate these changes in the next 
> revision.
> 

Think about this again, it seems that we no longer need the destroying
flag. Because we force to commit and wait for the **last** transaction to
complete, and the flush work should also ensure that the last sb_update
work to complete. Regardless of whether it starts a new handle in the
last update_super_work(), it will not commit since the journal should
have aborted. What are your thoughts?

 ext4_put_super()
  flush_work(&sbi->s_sb_upd_work)
  destroy_workqueue(sbi->rsv_conversion_wq)

  ext4_journal_destroy()
   /* trigger a commit (it will commit the last trnasaction) */

                    **kjournald2**
                    jbd2_journal_commit_transaction()
                    ...
                     ext4_inode_error()
                      schedule_work(s_sb_upd_work))

                                     **workqueue**
                                      update_super_work()
                                        jbd2_journal_start(journal)
                                          start_this_handle()
                                          //This new trans will
                                          //not be committed.

                     jbd2_journal_abort()

   /* wait for it to complete */

   flush_work(&sbi->s_sb_upd_work)
   jbd2_journal_destroy()
    journal->j_flags |= JBD2_UNMOUNT;
   jbd2_journal_commit_transaction() //it will commit nothing

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08  2:57                 ` Zhang Yi
@ 2025-03-08  8:18                   ` Ojaswin Mujoo
  2025-03-08  9:58                     ` Ojaswin Mujoo
  2025-03-08 10:01                   ` Ritesh Harjani
  1 sibling, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-08  8:18 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
> On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> >> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
> >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
> >>>>>>>> possible for update_super_work to start a transaction after the journal is
> >>>>>>>> destroyed eg:
> >>>>>>>>
> >>>>>>>> (umount)
> >>>>>>>> ext4_kill_sb
> >>>>>>>>   kill_block_super
> >>>>>>>>     generic_shutdown_super
> >>>>>>>>       sync_filesystem /* commits all txns */
> >>>>>>>>       evict_inodes
> >>>>>>>>         /* might start a new txn */
> >>>>>>>>       ext4_put_super
> >>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >>>>>>>>         jbd2_journal_destroy
> >>>>>>>>           journal_kill_thread
> >>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
> >>>>>>>>           jbd2_journal_commit_transaction
> >>>>>>>>             jbd2_journal_get_descriptor_buffer
> >>>>>>>>               jbd2_journal_bmap
> >>>>>>>>                 ext4_journal_bmap
> >>>>>>>>                   ext4_map_blocks
> >>>>>>>>                     ...
> >>>>>>>>                     ext4_inode_error
> >>>>>>>>                       ext4_handle_error
> >>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
> >>>>>>>>
> >>>>>>>>                                                /* work queue kicks in */
> >>>>>>>>                                                update_super_work
> >>>>>>>>                                                  jbd2_journal_start
> >>>>>>>>                                                    start_this_handle
> >>>>>>>>                                                      BUG_ON(journal->j_flags &
> >>>>>>>>                                                             JBD2_UNMOUNT)
> >>>>>>>>
> >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
> >>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
> >>>>>>>>
> >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
> >>>>>>>> during ext4_put_super() (except a running transaction that will get commited
> >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >>>>>>>> outside the journal as it won't race with a journaled update (refer
> >>>>>>>> 2d01ddc86606).
> >>>>>>>>
> >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
> >>>>>>>>
> >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> >>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
> >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >>>>>>>> ---
> >>>>>>>>  fs/ext4/ext4.h      | 2 ++
> >>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >>>>>>>>  fs/ext4/super.c     | 4 +++-
> >>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
> >>>>>>>> --- a/fs/ext4/ext4.h
> >>>>>>>> +++ b/fs/ext4/ext4.h
> >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >>>>>>>>  	 */
> >>>>>>>>  	struct work_struct s_sb_upd_work;
> >>>>>>>>  
> >>>>>>>> +	bool s_journal_destorying;
> >>>>>>>> +
> >>>>>>>>  	/* Atomic write unit values in bytes */
> >>>>>>>>  	unsigned int s_awu_min;
> >>>>>>>>  	unsigned int s_awu_max;
> >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
> >>>>>>>> --- a/fs/ext4/ext4_jbd2.h
> >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
> >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >>>>>>>>  {
> >>>>>>>>  	int err = 0;
> >>>>>>>>  
> >>>>>>>> +	/*
> >>>>>>>> +	 * At this point all pending FS updates should be done except a possible
> >>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> >>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
> >>>>>>>> +	 * than going via journal.
> >>>>>>>> +	 */
> >>>>>>>> +	sbi->s_journal_destorying = true;
> >>>>>>>> +
> >>>>>>>
> >>>>>>> Hi, Ojaswin!
> >>>>>>>
> >>>>>>> I'm afraid you still need to flush the superblock update work here,
> >>>>>>> otherwise I guess the race condition you mentioned in v1 could still
> >>>>>>> occur.
> >>>>>>>
> >>>>>>>  ext4_put_super()
> >>>>>>>   flush_work(&sbi->s_sb_upd_work)
> >>>>>>>
> >>>>>>>                     **kjournald2**
> >>>>>>>                     jbd2_journal_commit_transaction()
> >>>>>>>                     ...
> >>>>>>>                     ext4_inode_error()
> >>>>>>>                       /* JBD2_UNMOUNT not set */
> >>>>>>>                       schedule_work(s_sb_upd_work)
> >>>>>>>
> >>>>>>>                                   **workqueue**
> >>>>>>>                                    update_super_work
> >>>>>>>                                    /* s_journal_destorying is not set */
> >>>>>>>                             	   if (journal && !s_journal_destorying)
> >>>>>>>
> >>>>>>>   ext4_journal_destroy()
> >>>>>>>    /* set s_journal_destorying */
> >>>>>>>    sbi->s_journal_destorying = true;
> >>>>>>>    jbd2_journal_destroy()
> >>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
> >>>>>>>
> >>>>>>>                                        jbd2_journal_start()
> >>>>>>>                                         start_this_handle()
> >>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Yi.
> >>>>>> Hi Yi,
> >>>>>>
> >>>>>> Yes you are right, somehow missed this edge case :(
> >>>>>>
> >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
> >>>>>> helper. Just wondering if I should still let it be in
> >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
> >>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
> >>>>>>
> >>>>>> Okay let me give it some thought but thanks for pointing this out!
> >>>>>>
> >>>>>> Regards,
> >>>>>> ojaswin
> >>>>>
> >>>>> Okay so thinking about it a bit more, I see you also suggested to flush
> >>>>> the work after marking sbi->s_journal_destroying. But will that solve
> >>>>> it?
> >>>>>
> >>>>>   ext4_put_super()
> >>>>>    flush_work(&sbi->s_sb_upd_work)
> >>>>>  
> >>>>>                      **kjournald2**
> >>>>>                      jbd2_journal_commit_transaction()
> >>>>>                      ...
> >>>>>                      ext4_inode_error()
> >>>>>                        /* JBD2_UNMOUNT not set */
> >>>>>                        schedule_work(s_sb_upd_work)
> >>>>>  
> >>>>>                                     **workqueue**
> >>>>>                                     update_super_work
> >>>>>                                     /* s_journal_destorying is not set */
> >>>>>                              	      if (journal && !s_journal_destorying)
> >>>>>  
> >>>>>    ext4_journal_destroy()
> >>>>>     /* set s_journal_destorying */
> >>>>>     sbi->s_journal_destorying = true;
> >>>>>     flush_work(&sbi->s_sb_upd_work)
> >>>>>                                       schedule_work()
> >>>>                                         ^^^^^^^^^^^^^^^
> >>>>                                         where does this come from?
> >>>>
> >>>> After this flush_work, we can guarantee that the running s_sb_upd_work
> >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> >>>> not commit transaction or call schedule_work() again because it has
> >>>> been aborted due to the previous error. Am I missing something?
> >>>>
> >>>> Thanks,
> >>>> Yi.
> >>>
> >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
> >>>
> >>>  if(journal && !is_journal_destroying) 
> >>>
> >>> is computed but schedule_work() not called yet, which is possible cause
> >>> the cmp followed by jump is not atomic in nature. If the schedule_work
> >>> is only called after we have done the flush then we end up with this:
> >>>
> >>>                               	      if (journal && !s_journal_destorying)
> >>>     ext4_journal_destroy()
> >>>      /* set s_journal_destorying */
> >>>      sbi->s_journal_destorying = true;
> >>>      flush_work(&sbi->s_sb_upd_work)
> >>>                                        schedule_work()
> >>>
> >>> Which is possible IMO, although the window is tiny.
> >>
> >> Yeah, right!
> >> Sorry for misread the location where you add the "!s_journal_destorying"
> >> check, the graph I provided was in update_super_work(), which was wrong.
> > 
> > Oh right, I also misread your trace but yes as discussed, even 
> > 
> >     sbi->s_journal_destorying = true;
> > 		flush_work()
> >     jbd2_journal_destroy()
> > 
> > doesn't work.
> > 
> >> The right one should be:
> >>
> >>  ext4_put_super()
> >>   flush_work(&sbi->s_sb_upd_work)
> >>
> >>                     **kjournald2**
> >>                     jbd2_journal_commit_transaction()
> >>                     ...
> >>                     ext4_inode_error()
> >>                       /* s_journal_destorying is not set */
> >>                       if (journal && !s_journal_destorying)
> >>                         (schedule_work(s_sb_upd_work))  //can be here
> >>
> >>   ext4_journal_destroy()
> >>    /* set s_journal_destorying */
> >>    sbi->s_journal_destorying = true;
> >>    jbd2_journal_destroy()
> >>     journal->j_flags |= JBD2_UNMOUNT;
> >>
> >>                         (schedule_work(s_sb_upd_work))  //also can be here
> >>
> >>                                   **workqueue**
> >>                                    update_super_work()
> >>                                    journal = sbi->s_journal //get journal
> >>     kfree(journal)
> >>                                      jbd2_journal_start(journal) //journal UAF
> >>                                        start_this_handle()
> >>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
> >>
> >>
> >> So there are two problems here, the first one is the 'journal' UAF,
> >> the second one is triggering JBD2_UNMOUNT flag BUGON.
> > 
> > Indeed, there's a possible UAF here as well.
> > 
> >>
> >>>>>
> >>>>> As for the fix, how about we do something like this:
> >>>>>
> >>>>>   ext4_put_super()
> >>>>>
> >>>>>    flush_work(&sbi->s_sb_upd_work)
> >>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
> >>>>>
> >>>>>    ext4_journal_destroy()
> >>>>>     /* set s_journal_destorying */
> >>>>>     sbi->s_journal_destorying = true;
> >>>>>
> >>>>>    /* trigger a commit and wait for it to complete */
> >>>>>
> >>>>>     flush_work(&sbi->s_sb_upd_work)
> >>>>>
> >>>>>     jbd2_journal_destroy()
> >>>>>      journal->j_flags |= JBD2_UNMOUNT;
> >>>>>  
> >>>>>                                         jbd2_journal_start()
> >>>>>                                          start_this_handle()
> >>>>>                                            BUG_ON(JBD2_UNMOUNT)
> >>>>>
> >>>>> Still giving this codepath some thought but seems like this might just
> >>>>> be enough to fix the race. Thoughts on this?
> >>>>>
> >>
> >> I think this solution should work, the forced commit and flush_work()
> >> should ensure that the last transaction is committed and that the
> >> potential work is done.
> >>
> >> Besides, the s_journal_destorying flag is set and check concurrently
> >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> >> about adding a new flag into sbi->s_mount_state instead of adding
> >> new s_journal_destorying?
> > 
> > Right, that makes sence. I will incorporate these changes in the next 
> > revision.
> > 
> 
> Think about this again, it seems that we no longer need the destroying
> flag. Because we force to commit and wait for the **last** transaction to
> complete, and the flush work should also ensure that the last sb_update
> work to complete. Regardless of whether it starts a new handle in the
> last update_super_work(), it will not commit since the journal should
> have aborted. What are your thoughts?
> 
>  ext4_put_super()
>   flush_work(&sbi->s_sb_upd_work)
>   destroy_workqueue(sbi->rsv_conversion_wq)
> 
>   ext4_journal_destroy()
>    /* trigger a commit (it will commit the last trnasaction) */
> 
>                     **kjournald2**
>                     jbd2_journal_commit_transaction()
>                     ...
>                      ext4_inode_error()
>                       schedule_work(s_sb_upd_work))
> 
>                                      **workqueue**
>                                       update_super_work()
>                                         jbd2_journal_start(journal)
>                                           start_this_handle()
>                                           //This new trans will
>                                           //not be committed.
> 
>                      jbd2_journal_abort()
> 
>    /* wait for it to complete */
> 
>    flush_work(&sbi->s_sb_upd_work)
>    jbd2_journal_destroy()
>     journal->j_flags |= JBD2_UNMOUNT;
>    jbd2_journal_commit_transaction() //it will commit nothing
> 
> Thanks,
> Yi.

Hi Yi,

There's one more path for which we need the flag:

   ext4_journal_destroy()
    /* trigger a commit (it will commit the last trnasaction) */
 
                     **kjournald2**
                     jbd2_journal_commit_transaction()
										 	journal->j_commit_callback()
											  ext4_journal_commit_callback()
												  ext4_maybe_update_superblock()
													  schedule_work()
			/* start a transaction here */
			flush_work()
			  jbd2_journal_destroy()
				  journal_kill_thread
					  flags |= JBD2_UNMOUNT
				  jbd2_journal_commit_transaction()
					  ...
            ext4_inode_error()
							schedule_work(s_sb_upd_work))
							/* update_super_work_tries to start the txn */
							BUG_ON()

I think this to protect against this path we do need a flag. 

Regards,
ojaswin
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
  2025-03-07  2:49   ` Zhang Yi
  2025-03-07 14:26   ` Jan Kara
@ 2025-03-08  9:55   ` Ritesh Harjani (IBM)
  2025-03-08 13:05     ` Ojaswin Mujoo
  2025-03-12 13:57   ` Eric Sandeen
  3 siblings, 1 reply; 41+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-03-08  9:55 UTC (permalink / raw)
  To: Ojaswin Mujoo, linux-ext4, Theodore Ts'o
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar

Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:

> Presently we always BUG_ON if trying to start a transaction on a journal marked
> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> stress tests, it was observed that in case of some error handling paths, it is
> possible for update_super_work to start a transaction after the journal is
> destroyed eg:
>
> (umount)
> ext4_kill_sb
>   kill_block_super
>     generic_shutdown_super
>       sync_filesystem /* commits all txns */
>       evict_inodes
>         /* might start a new txn */
>       ext4_put_super
> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>         jbd2_journal_destroy
>           journal_kill_thread
>             journal->j_flags |= JBD2_UNMOUNT;
>           jbd2_journal_commit_transaction
>             jbd2_journal_get_descriptor_buffer
>               jbd2_journal_bmap
>                 ext4_journal_bmap
>                   ext4_map_blocks
>                     ...
>                     ext4_inode_error
>                       ext4_handle_error
>                         schedule_work(&sbi->s_sb_upd_work)
>
>                                                /* work queue kicks in */
>                                                update_super_work
>                                                  jbd2_journal_start
>                                                    start_this_handle
>                                                      BUG_ON(journal->j_flags &
>                                                             JBD2_UNMOUNT)
>
> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> destroying only do a journaled (and deferred) update of sb if this flag is not
> set. Otherwise, just fallback to an un-journaled commit.
>
> We set sbi->s_journal_destroying = true only after all the FS updates are done
> during ext4_put_super() (except a running transaction that will get commited
> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> outside the journal as it won't race with a journaled update (refer
> 2d01ddc86606).
>
> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> called from mballoc and AFAICT it would be always valid to schedule work here.
>
> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/ext4/ext4.h      | 2 ++
>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>  fs/ext4/super.c     | 4 +++-
>  3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2b7d781bfcad..d48e93bd5690 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>  	 */
>  	struct work_struct s_sb_upd_work;
>  
> +	bool s_journal_destorying;
> +
>  	/* Atomic write unit values in bytes */
>  	unsigned int s_awu_min;
>  	unsigned int s_awu_max;
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 9b3c9df02a39..6bd3ca84410d 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>  {
>  	int err = 0;
>  
> +	/*
> +	 * At this point all pending FS updates should be done except a possible
> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> +	 * is now safe for any new errors to directly commit superblock rather
> +	 * than going via journal.
> +	 */
> +	sbi->s_journal_destorying = true;

This is not correct right. I think what we decided to set this flag
before we flush the workqueue. So that we don't schedule any new
work after this flag has been set. At least that is what I understood.

[1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/

-ritesh


> +
>  	err = jbd2_journal_destroy(journal);
>  	sbi->s_journal = NULL;
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8ad664d47806..31552cf0519a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>  		 * constraints, it may not be safe to do it right here so we
>  		 * defer superblock flushing to a workqueue.
>  		 */
> -		if (continue_fs && journal)
> +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>  		else
>  			ext4_commit_super(sb);
> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	spin_lock_init(&sbi->s_error_lock);
>  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>  
> +	sbi->s_journal_destorying = false;
> +
>  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>  	if (err)
>  		goto failed_mount3;
> -- 
> 2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08  8:18                   ` Ojaswin Mujoo
@ 2025-03-08  9:58                     ` Ojaswin Mujoo
  2025-03-08 10:10                       ` Baokun Li
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-08  9:58 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote:
> On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
> > On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> > >> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> > >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> > >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> > >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> > >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> > >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
> > >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
> > >>>>>>>> possible for update_super_work to start a transaction after the journal is
> > >>>>>>>> destroyed eg:
> > >>>>>>>>
> > >>>>>>>> (umount)
> > >>>>>>>> ext4_kill_sb
> > >>>>>>>>   kill_block_super
> > >>>>>>>>     generic_shutdown_super
> > >>>>>>>>       sync_filesystem /* commits all txns */
> > >>>>>>>>       evict_inodes
> > >>>>>>>>         /* might start a new txn */
> > >>>>>>>>       ext4_put_super
> > >>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > >>>>>>>>         jbd2_journal_destroy
> > >>>>>>>>           journal_kill_thread
> > >>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>>>>           jbd2_journal_commit_transaction
> > >>>>>>>>             jbd2_journal_get_descriptor_buffer
> > >>>>>>>>               jbd2_journal_bmap
> > >>>>>>>>                 ext4_journal_bmap
> > >>>>>>>>                   ext4_map_blocks
> > >>>>>>>>                     ...
> > >>>>>>>>                     ext4_inode_error
> > >>>>>>>>                       ext4_handle_error
> > >>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
> > >>>>>>>>
> > >>>>>>>>                                                /* work queue kicks in */
> > >>>>>>>>                                                update_super_work
> > >>>>>>>>                                                  jbd2_journal_start
> > >>>>>>>>                                                    start_this_handle
> > >>>>>>>>                                                      BUG_ON(journal->j_flags &
> > >>>>>>>>                                                             JBD2_UNMOUNT)
> > >>>>>>>>
> > >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
> > >>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
> > >>>>>>>>
> > >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
> > >>>>>>>> during ext4_put_super() (except a running transaction that will get commited
> > >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > >>>>>>>> outside the journal as it won't race with a journaled update (refer
> > >>>>>>>> 2d01ddc86606).
> > >>>>>>>>
> > >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
> > >>>>>>>>
> > >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > >>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
> > >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > >>>>>>>> ---
> > >>>>>>>>  fs/ext4/ext4.h      | 2 ++
> > >>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
> > >>>>>>>>  fs/ext4/super.c     | 4 +++-
> > >>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
> > >>>>>>>>
> > >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
> > >>>>>>>> --- a/fs/ext4/ext4.h
> > >>>>>>>> +++ b/fs/ext4/ext4.h
> > >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > >>>>>>>>  	 */
> > >>>>>>>>  	struct work_struct s_sb_upd_work;
> > >>>>>>>>  
> > >>>>>>>> +	bool s_journal_destorying;
> > >>>>>>>> +
> > >>>>>>>>  	/* Atomic write unit values in bytes */
> > >>>>>>>>  	unsigned int s_awu_min;
> > >>>>>>>>  	unsigned int s_awu_max;
> > >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
> > >>>>>>>> --- a/fs/ext4/ext4_jbd2.h
> > >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
> > >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > >>>>>>>>  {
> > >>>>>>>>  	int err = 0;
> > >>>>>>>>  
> > >>>>>>>> +	/*
> > >>>>>>>> +	 * At this point all pending FS updates should be done except a possible
> > >>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > >>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
> > >>>>>>>> +	 * than going via journal.
> > >>>>>>>> +	 */
> > >>>>>>>> +	sbi->s_journal_destorying = true;
> > >>>>>>>> +
> > >>>>>>>
> > >>>>>>> Hi, Ojaswin!
> > >>>>>>>
> > >>>>>>> I'm afraid you still need to flush the superblock update work here,
> > >>>>>>> otherwise I guess the race condition you mentioned in v1 could still
> > >>>>>>> occur.
> > >>>>>>>
> > >>>>>>>  ext4_put_super()
> > >>>>>>>   flush_work(&sbi->s_sb_upd_work)
> > >>>>>>>
> > >>>>>>>                     **kjournald2**
> > >>>>>>>                     jbd2_journal_commit_transaction()
> > >>>>>>>                     ...
> > >>>>>>>                     ext4_inode_error()
> > >>>>>>>                       /* JBD2_UNMOUNT not set */
> > >>>>>>>                       schedule_work(s_sb_upd_work)
> > >>>>>>>
> > >>>>>>>                                   **workqueue**
> > >>>>>>>                                    update_super_work
> > >>>>>>>                                    /* s_journal_destorying is not set */
> > >>>>>>>                             	   if (journal && !s_journal_destorying)
> > >>>>>>>
> > >>>>>>>   ext4_journal_destroy()
> > >>>>>>>    /* set s_journal_destorying */
> > >>>>>>>    sbi->s_journal_destorying = true;
> > >>>>>>>    jbd2_journal_destroy()
> > >>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>>>
> > >>>>>>>                                        jbd2_journal_start()
> > >>>>>>>                                         start_this_handle()
> > >>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Yi.
> > >>>>>> Hi Yi,
> > >>>>>>
> > >>>>>> Yes you are right, somehow missed this edge case :(
> > >>>>>>
> > >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
> > >>>>>> helper. Just wondering if I should still let it be in
> > >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
> > >>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
> > >>>>>>
> > >>>>>> Okay let me give it some thought but thanks for pointing this out!
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> ojaswin
> > >>>>>
> > >>>>> Okay so thinking about it a bit more, I see you also suggested to flush
> > >>>>> the work after marking sbi->s_journal_destroying. But will that solve
> > >>>>> it?
> > >>>>>
> > >>>>>   ext4_put_super()
> > >>>>>    flush_work(&sbi->s_sb_upd_work)
> > >>>>>  
> > >>>>>                      **kjournald2**
> > >>>>>                      jbd2_journal_commit_transaction()
> > >>>>>                      ...
> > >>>>>                      ext4_inode_error()
> > >>>>>                        /* JBD2_UNMOUNT not set */
> > >>>>>                        schedule_work(s_sb_upd_work)
> > >>>>>  
> > >>>>>                                     **workqueue**
> > >>>>>                                     update_super_work
> > >>>>>                                     /* s_journal_destorying is not set */
> > >>>>>                              	      if (journal && !s_journal_destorying)
> > >>>>>  
> > >>>>>    ext4_journal_destroy()
> > >>>>>     /* set s_journal_destorying */
> > >>>>>     sbi->s_journal_destorying = true;
> > >>>>>     flush_work(&sbi->s_sb_upd_work)
> > >>>>>                                       schedule_work()
> > >>>>                                         ^^^^^^^^^^^^^^^
> > >>>>                                         where does this come from?
> > >>>>
> > >>>> After this flush_work, we can guarantee that the running s_sb_upd_work
> > >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> > >>>> not commit transaction or call schedule_work() again because it has
> > >>>> been aborted due to the previous error. Am I missing something?
> > >>>>
> > >>>> Thanks,
> > >>>> Yi.
> > >>>
> > >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
> > >>>
> > >>>  if(journal && !is_journal_destroying) 
> > >>>
> > >>> is computed but schedule_work() not called yet, which is possible cause
> > >>> the cmp followed by jump is not atomic in nature. If the schedule_work
> > >>> is only called after we have done the flush then we end up with this:
> > >>>
> > >>>                               	      if (journal && !s_journal_destorying)
> > >>>     ext4_journal_destroy()
> > >>>      /* set s_journal_destorying */
> > >>>      sbi->s_journal_destorying = true;
> > >>>      flush_work(&sbi->s_sb_upd_work)
> > >>>                                        schedule_work()
> > >>>
> > >>> Which is possible IMO, although the window is tiny.
> > >>
> > >> Yeah, right!
> > >> Sorry for misread the location where you add the "!s_journal_destorying"
> > >> check, the graph I provided was in update_super_work(), which was wrong.
> > > 
> > > Oh right, I also misread your trace but yes as discussed, even 
> > > 
> > >     sbi->s_journal_destorying = true;
> > > 		flush_work()
> > >     jbd2_journal_destroy()
> > > 
> > > doesn't work.
> > > 
> > >> The right one should be:
> > >>
> > >>  ext4_put_super()
> > >>   flush_work(&sbi->s_sb_upd_work)
> > >>
> > >>                     **kjournald2**
> > >>                     jbd2_journal_commit_transaction()
> > >>                     ...
> > >>                     ext4_inode_error()
> > >>                       /* s_journal_destorying is not set */
> > >>                       if (journal && !s_journal_destorying)
> > >>                         (schedule_work(s_sb_upd_work))  //can be here
> > >>
> > >>   ext4_journal_destroy()
> > >>    /* set s_journal_destorying */
> > >>    sbi->s_journal_destorying = true;
> > >>    jbd2_journal_destroy()
> > >>     journal->j_flags |= JBD2_UNMOUNT;
> > >>
> > >>                         (schedule_work(s_sb_upd_work))  //also can be here
> > >>
> > >>                                   **workqueue**
> > >>                                    update_super_work()
> > >>                                    journal = sbi->s_journal //get journal
> > >>     kfree(journal)
> > >>                                      jbd2_journal_start(journal) //journal UAF
> > >>                                        start_this_handle()
> > >>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
> > >>
> > >>
> > >> So there are two problems here, the first one is the 'journal' UAF,
> > >> the second one is triggering JBD2_UNMOUNT flag BUGON.
> > > 
> > > Indeed, there's a possible UAF here as well.
> > > 
> > >>
> > >>>>>
> > >>>>> As for the fix, how about we do something like this:
> > >>>>>
> > >>>>>   ext4_put_super()
> > >>>>>
> > >>>>>    flush_work(&sbi->s_sb_upd_work)
> > >>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
> > >>>>>
> > >>>>>    ext4_journal_destroy()
> > >>>>>     /* set s_journal_destorying */
> > >>>>>     sbi->s_journal_destorying = true;
> > >>>>>
> > >>>>>    /* trigger a commit and wait for it to complete */
> > >>>>>
> > >>>>>     flush_work(&sbi->s_sb_upd_work)
> > >>>>>
> > >>>>>     jbd2_journal_destroy()
> > >>>>>      journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>  
> > >>>>>                                         jbd2_journal_start()
> > >>>>>                                          start_this_handle()
> > >>>>>                                            BUG_ON(JBD2_UNMOUNT)
> > >>>>>
> > >>>>> Still giving this codepath some thought but seems like this might just
> > >>>>> be enough to fix the race. Thoughts on this?
> > >>>>>
> > >>
> > >> I think this solution should work, the forced commit and flush_work()
> > >> should ensure that the last transaction is committed and that the
> > >> potential work is done.
> > >>
> > >> Besides, the s_journal_destorying flag is set and check concurrently
> > >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> > >> about adding a new flag into sbi->s_mount_state instead of adding
> > >> new s_journal_destorying?
> > > 
> > > Right, that makes sence. I will incorporate these changes in the next 
> > > revision.
> > > 
> > 
> > Think about this again, it seems that we no longer need the destroying
> > flag. Because we force to commit and wait for the **last** transaction to
> > complete, and the flush work should also ensure that the last sb_update
> > work to complete. Regardless of whether it starts a new handle in the
> > last update_super_work(), it will not commit since the journal should
> > have aborted. What are your thoughts?
> > 
> >  ext4_put_super()
> >   flush_work(&sbi->s_sb_upd_work)
> >   destroy_workqueue(sbi->rsv_conversion_wq)
> > 
> >   ext4_journal_destroy()
> >    /* trigger a commit (it will commit the last trnasaction) */
> > 
> >                     **kjournald2**
> >                     jbd2_journal_commit_transaction()
> >                     ...
> >                      ext4_inode_error()
> >                       schedule_work(s_sb_upd_work))
> > 
> >                                      **workqueue**
> >                                       update_super_work()
> >                                         jbd2_journal_start(journal)
> >                                           start_this_handle()
> >                                           //This new trans will
> >                                           //not be committed.
> > 
> >                      jbd2_journal_abort()
> > 
> >    /* wait for it to complete */
> > 
> >    flush_work(&sbi->s_sb_upd_work)
> >    jbd2_journal_destroy()
> >     journal->j_flags |= JBD2_UNMOUNT;
> >    jbd2_journal_commit_transaction() //it will commit nothing
> > 
> > Thanks,
> > Yi.
> 
> Hi Yi,
> 
> There's one more path for which we need the flag:
> 
>    ext4_journal_destroy()
>     /* trigger a commit (it will commit the last trnasaction) */
>  
>                      **kjournald2**
>                      jbd2_journal_commit_transaction()
> 										 	journal->j_commit_callback()
> 											  ext4_journal_commit_callback()
> 												  ext4_maybe_update_superblock()
> 													  schedule_work()
> 			/* start a transaction here */
> 			flush_work()
> 			  jbd2_journal_destroy()
> 				  journal_kill_thread
> 					  flags |= JBD2_UNMOUNT
> 				  jbd2_journal_commit_transaction()
> 					  ...
>             ext4_inode_error()
> 							schedule_work(s_sb_upd_work))
> 							/* update_super_work_tries to start the txn */
> 							BUG_ON()

Oops the formatting is wrong, here's the trace:

 ext4_journal_destroy()
    /* trigger a commit (it will commit the last trnasaction) */

                   **kjournald2**
                   jbd2_journal_commit_transaction()
                    journal->j_commit_callback()
                      ext4_journal_commit_callback()
                        ext4_maybe_update_superblock()
                          schedule_work()

    /* update_super_work starts a new txn here */
    flush_work()
    jbd2_journal_destroy()
      journal_kill_thread
        flags |= JBD2_UNMOUNT
      jbd2_journal_commit_transaction()
        ...
        ext4_inode_error()
          schedule_work(s_sb_upd_work))
          /* update_super_work_tries to start the txn */
          BUG_ON()

> 
> I think this to protect against this path we do need a flag. 
> 
> Regards,
> ojaswin
> > 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08  2:57                 ` Zhang Yi
  2025-03-08  8:18                   ` Ojaswin Mujoo
@ 2025-03-08 10:01                   ` Ritesh Harjani
  1 sibling, 0 replies; 41+ messages in thread
From: Ritesh Harjani @ 2025-03-08 10:01 UTC (permalink / raw)
  To: Zhang Yi, Ojaswin Mujoo
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

Zhang Yi <yi.zhang@huaweicloud.com> writes:

> On 2025/3/8 1:26, Ojaswin Mujoo wrote:
>> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>>> destroyed eg:
>>>>>>>>>
>>>>>>>>> (umount)
>>>>>>>>> ext4_kill_sb
>>>>>>>>>   kill_block_super
>>>>>>>>>     generic_shutdown_super
>>>>>>>>>       sync_filesystem /* commits all txns */
>>>>>>>>>       evict_inodes
>>>>>>>>>         /* might start a new txn */
>>>>>>>>>       ext4_put_super
>>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>>         jbd2_journal_destroy
>>>>>>>>>           journal_kill_thread
>>>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>>           jbd2_journal_commit_transaction
>>>>>>>>>             jbd2_journal_get_descriptor_buffer
>>>>>>>>>               jbd2_journal_bmap
>>>>>>>>>                 ext4_journal_bmap
>>>>>>>>>                   ext4_map_blocks
>>>>>>>>>                     ...
>>>>>>>>>                     ext4_inode_error
>>>>>>>>>                       ext4_handle_error
>>>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>>
>>>>>>>>>                                                /* work queue kicks in */
>>>>>>>>>                                                update_super_work
>>>>>>>>>                                                  jbd2_journal_start
>>>>>>>>>                                                    start_this_handle
>>>>>>>>>                                                      BUG_ON(journal->j_flags &
>>>>>>>>>                                                             JBD2_UNMOUNT)
>>>>>>>>>
>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>>
>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>>> 2d01ddc86606).
>>>>>>>>>
>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>>
>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>>>>>>> ---
>>>>>>>>>  fs/ext4/ext4.h      | 2 ++
>>>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>>  fs/ext4/super.c     | 4 +++-
>>>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>>  	 */
>>>>>>>>>  	struct work_struct s_sb_upd_work;
>>>>>>>>>  
>>>>>>>>> +	bool s_journal_destorying;
>>>>>>>>> +
>>>>>>>>>  	/* Atomic write unit values in bytes */
>>>>>>>>>  	unsigned int s_awu_min;
>>>>>>>>>  	unsigned int s_awu_max;
>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>>  {
>>>>>>>>>  	int err = 0;
>>>>>>>>>  
>>>>>>>>> +	/*
>>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>>> +	 * than going via journal.
>>>>>>>>> +	 */
>>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Hi, Ojaswin!
>>>>>>>>
>>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>>> occur.
>>>>>>>>
>>>>>>>>  ext4_put_super()
>>>>>>>>   flush_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>>                     **kjournald2**
>>>>>>>>                     jbd2_journal_commit_transaction()
>>>>>>>>                     ...
>>>>>>>>                     ext4_inode_error()
>>>>>>>>                       /* JBD2_UNMOUNT not set */
>>>>>>>>                       schedule_work(s_sb_upd_work)
>>>>>>>>
>>>>>>>>                                   **workqueue**
>>>>>>>>                                    update_super_work
>>>>>>>>                                    /* s_journal_destorying is not set */
>>>>>>>>                             	   if (journal && !s_journal_destorying)
>>>>>>>>
>>>>>>>>   ext4_journal_destroy()
>>>>>>>>    /* set s_journal_destorying */
>>>>>>>>    sbi->s_journal_destorying = true;
>>>>>>>>    jbd2_journal_destroy()
>>>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>
>>>>>>>>                                        jbd2_journal_start()
>>>>>>>>                                         start_this_handle()
>>>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yi.
>>>>>>> Hi Yi,
>>>>>>>
>>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>>
>>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>>> helper. Just wondering if I should still let it be in
>>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>>
>>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>>
>>>>>>> Regards,
>>>>>>> ojaswin
>>>>>>
>>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>>> it?
>>>>>>
>>>>>>   ext4_put_super()
>>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>>  
>>>>>>                      **kjournald2**
>>>>>>                      jbd2_journal_commit_transaction()
>>>>>>                      ...
>>>>>>                      ext4_inode_error()
>>>>>>                        /* JBD2_UNMOUNT not set */
>>>>>>                        schedule_work(s_sb_upd_work)
>>>>>>  
>>>>>>                                     **workqueue**
>>>>>>                                     update_super_work
>>>>>>                                     /* s_journal_destorying is not set */
>>>>>>                              	      if (journal && !s_journal_destorying)
>>>>>>  
>>>>>>    ext4_journal_destroy()
>>>>>>     /* set s_journal_destorying */
>>>>>>     sbi->s_journal_destorying = true;
>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>                                       schedule_work()
>>>>>                                         ^^^^^^^^^^^^^^^
>>>>>                                         where does this come from?
>>>>>
>>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>>> not commit transaction or call schedule_work() again because it has
>>>>> been aborted due to the previous error. Am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Yi.
>>>>
>>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
>>>>
>>>>  if(journal && !is_journal_destroying) 
>>>>
>>>> is computed but schedule_work() not called yet, which is possible cause
>>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>>> is only called after we have done the flush then we end up with this:
>>>>
>>>>                               	      if (journal && !s_journal_destorying)
>>>>     ext4_journal_destroy()
>>>>      /* set s_journal_destorying */
>>>>      sbi->s_journal_destorying = true;
>>>>      flush_work(&sbi->s_sb_upd_work)
>>>>                                        schedule_work()
>>>>
>>>> Which is possible IMO, although the window is tiny.
>>>
>>> Yeah, right!
>>> Sorry for misread the location where you add the "!s_journal_destorying"
>>> check, the graph I provided was in update_super_work(), which was wrong.
>> 
>> Oh right, I also misread your trace but yes as discussed, even 
>> 
>>     sbi->s_journal_destorying = true;
>> 		flush_work()
>>     jbd2_journal_destroy()
>> 
>> doesn't work.
>> 
>>> The right one should be:
>>>
>>>  ext4_put_super()
>>>   flush_work(&sbi->s_sb_upd_work)
>>>
>>>                     **kjournald2**
>>>                     jbd2_journal_commit_transaction()
>>>                     ...
>>>                     ext4_inode_error()
>>>                       /* s_journal_destorying is not set */
>>>                       if (journal && !s_journal_destorying)
>>>                         (schedule_work(s_sb_upd_work))  //can be here
>>>
>>>   ext4_journal_destroy()
>>>    /* set s_journal_destorying */
>>>    sbi->s_journal_destorying = true;
>>>    jbd2_journal_destroy()
>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>
>>>                         (schedule_work(s_sb_upd_work))  //also can be here
>>>
>>>                                   **workqueue**
>>>                                    update_super_work()
>>>                                    journal = sbi->s_journal //get journal
>>>     kfree(journal)
>>>                                      jbd2_journal_start(journal) //journal UAF
>>>                                        start_this_handle()
>>>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
>>>
>>>
>>> So there are two problems here, the first one is the 'journal' UAF,
>>> the second one is triggering JBD2_UNMOUNT flag BUGON.
>> 
>> Indeed, there's a possible UAF here as well.
>> 
>>>
>>>>>>
>>>>>> As for the fix, how about we do something like this:
>>>>>>
>>>>>>   ext4_put_super()
>>>>>>
>>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>>
>>>>>>    ext4_journal_destroy()
>>>>>>     /* set s_journal_destorying */
>>>>>>     sbi->s_journal_destorying = true;
>>>>>>
>>>>>>    /* trigger a commit and wait for it to complete */
>>>>>>
>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>
>>>>>>     jbd2_journal_destroy()
>>>>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>>>>  
>>>>>>                                         jbd2_journal_start()
>>>>>>                                          start_this_handle()
>>>>>>                                            BUG_ON(JBD2_UNMOUNT)
>>>>>>
>>>>>> Still giving this codepath some thought but seems like this might just
>>>>>> be enough to fix the race. Thoughts on this?
>>>>>>
>>>
>>> I think this solution should work, the forced commit and flush_work()
>>> should ensure that the last transaction is committed and that the
>>> potential work is done.
>>>
>>> Besides, the s_journal_destorying flag is set and check concurrently
>>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>>> about adding a new flag into sbi->s_mount_state instead of adding
>>> new s_journal_destorying?
>> 
>> Right, that makes sence. I will incorporate these changes in the next 
>> revision.
>> 
>
> Think about this again, it seems that we no longer need the destroying
> flag. Because we force to commit and wait for the **last** transaction to
> complete, and the flush work should also ensure that the last sb_update
> work to complete. Regardless of whether it starts a new handle in the
> last update_super_work(), it will not commit since the journal should
> have aborted. What are your thoughts?
>

I think the confusion maybe coming because v2 patch isn't where we
discussed to put the s_journal_destroying to true, in this thread [1]

[1]: https://lore.kernel.org/linux-ext4/jnxpphuradrsf73cxfmohfu7wwwckihtulw6ovsitddgt5pqkg@2uoejkr66qnl/


>  ext4_put_super()

   + sbi->s_journal_destroying = true;

We should add s_journal_destroying to true before calling for
flush_work. 

>   flush_work(&sbi->s_sb_upd_work)

After the above flush work is complete, we will always check if
s_journal_destroying is set. If yes, then we should never schedule the
sb update work

>   destroy_workqueue(sbi->rsv_conversion_wq)
>
>   ext4_journal_destroy()
>    /* trigger a commit (it will commit the last trnasaction) */
>
>                     **kjournald2**
>                     jbd2_journal_commit_transaction()
>                     ...
>                      ext4_inode_error()
>                       schedule_work(s_sb_upd_work))

Then this schedule work will never happen, since it will check if
s_journal_destroying flag is set. 

>
>                                      **workqueue**
>                                       update_super_work()
>                                         jbd2_journal_start(journal)
>                                           start_this_handle()
>                                           //This new trans will
>                                           //not be committed.
>
>                      jbd2_journal_abort()
>
>    /* wait for it to complete */
>
>    flush_work(&sbi->s_sb_upd_work)

No need to again call the flush work here, since there is no new work
which will be scheduled right.

Am I missing something?

-ritesh


>    jbd2_journal_destroy()
>     journal->j_flags |= JBD2_UNMOUNT;
>    jbd2_journal_commit_transaction() //it will commit nothing
>
> Thanks,
> Yi.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08  9:58                     ` Ojaswin Mujoo
@ 2025-03-08 10:10                       ` Baokun Li
  2025-03-08 12:57                         ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Baokun Li @ 2025-03-08 10:10 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: Jan Kara, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On 2025/3/8 17:58, Ojaswin Mujoo wrote:
> On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote:
>> On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
>>> On 2025/3/8 1:26, Ojaswin Mujoo wrote:
>>>> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>>>>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>>>>> destroyed eg:
>>>>>>>>>>>
>>>>>>>>>>> (umount)
>>>>>>>>>>> ext4_kill_sb
>>>>>>>>>>>    kill_block_super
>>>>>>>>>>>      generic_shutdown_super
>>>>>>>>>>>        sync_filesystem /* commits all txns */
>>>>>>>>>>>        evict_inodes
>>>>>>>>>>>          /* might start a new txn */
>>>>>>>>>>>        ext4_put_super
>>>>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>>>>          jbd2_journal_destroy
>>>>>>>>>>>            journal_kill_thread
>>>>>>>>>>>              journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>>>>            jbd2_journal_commit_transaction
>>>>>>>>>>>              jbd2_journal_get_descriptor_buffer
>>>>>>>>>>>                jbd2_journal_bmap
>>>>>>>>>>>                  ext4_journal_bmap
>>>>>>>>>>>                    ext4_map_blocks
>>>>>>>>>>>                      ...
>>>>>>>>>>>                      ext4_inode_error
>>>>>>>>>>>                        ext4_handle_error
>>>>>>>>>>>                          schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>>>>
>>>>>>>>>>>                                                 /* work queue kicks in */
>>>>>>>>>>>                                                 update_super_work
>>>>>>>>>>>                                                   jbd2_journal_start
>>>>>>>>>>>                                                     start_this_handle
>>>>>>>>>>>                                                       BUG_ON(journal->j_flags &
>>>>>>>>>>>                                                              JBD2_UNMOUNT)
>>>>>>>>>>>
>>>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>>>>
>>>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>>>>> 2d01ddc86606).
>>>>>>>>>>>
>>>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>>>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>>>>>>>>> ---
>>>>>>>>>>>   fs/ext4/ext4.h      | 2 ++
>>>>>>>>>>>   fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>>>>   fs/ext4/super.c     | 4 +++-
>>>>>>>>>>>   3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>>>>   	 */
>>>>>>>>>>>   	struct work_struct s_sb_upd_work;
>>>>>>>>>>>   
>>>>>>>>>>> +	bool s_journal_destorying;
>>>>>>>>>>> +
>>>>>>>>>>>   	/* Atomic write unit values in bytes */
>>>>>>>>>>>   	unsigned int s_awu_min;
>>>>>>>>>>>   	unsigned int s_awu_max;
>>>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>>>>   {
>>>>>>>>>>>   	int err = 0;
>>>>>>>>>>>   
>>>>>>>>>>> +	/*
>>>>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>>>>> +	 * than going via journal.
>>>>>>>>>>> +	 */
>>>>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>>>>> +
>>>>>>>>>> Hi, Ojaswin!
>>>>>>>>>>
>>>>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>>>>> occur.
>>>>>>>>>>
>>>>>>>>>>   ext4_put_super()
>>>>>>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>>>>>>
>>>>>>>>>>                      **kjournald2**
>>>>>>>>>>                      jbd2_journal_commit_transaction()
>>>>>>>>>>                      ...
>>>>>>>>>>                      ext4_inode_error()
>>>>>>>>>>                        /* JBD2_UNMOUNT not set */
>>>>>>>>>>                        schedule_work(s_sb_upd_work)
>>>>>>>>>>
>>>>>>>>>>                                    **workqueue**
>>>>>>>>>>                                     update_super_work
>>>>>>>>>>                                     /* s_journal_destorying is not set */
>>>>>>>>>>                              	   if (journal && !s_journal_destorying)
>>>>>>>>>>
>>>>>>>>>>    ext4_journal_destroy()
>>>>>>>>>>     /* set s_journal_destorying */
>>>>>>>>>>     sbi->s_journal_destorying = true;
>>>>>>>>>>     jbd2_journal_destroy()
>>>>>>>>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>>>
>>>>>>>>>>                                         jbd2_journal_start()
>>>>>>>>>>                                          start_this_handle()
>>>>>>>>>>                                            BUG_ON(JBD2_UNMOUNT)
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yi.
>>>>>>>>> Hi Yi,
>>>>>>>>>
>>>>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>>>>
>>>>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>>>>> helper. Just wondering if I should still let it be in
>>>>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>>>>
>>>>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> ojaswin
>>>>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>>>>> it?
>>>>>>>>
>>>>>>>>    ext4_put_super()
>>>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>>>   
>>>>>>>>                       **kjournald2**
>>>>>>>>                       jbd2_journal_commit_transaction()
>>>>>>>>                       ...
>>>>>>>>                       ext4_inode_error()
>>>>>>>>                         /* JBD2_UNMOUNT not set */
>>>>>>>>                         schedule_work(s_sb_upd_work)
>>>>>>>>   
>>>>>>>>                                      **workqueue**
>>>>>>>>                                      update_super_work
>>>>>>>>                                      /* s_journal_destorying is not set */
>>>>>>>>                               	      if (journal && !s_journal_destorying)
>>>>>>>>   
>>>>>>>>     ext4_journal_destroy()
>>>>>>>>      /* set s_journal_destorying */
>>>>>>>>      sbi->s_journal_destorying = true;
>>>>>>>>      flush_work(&sbi->s_sb_upd_work)
>>>>>>>>                                        schedule_work()
>>>>>>>                                          ^^^^^^^^^^^^^^^
>>>>>>>                                          where does this come from?
>>>>>>>
>>>>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>>>>> not commit transaction or call schedule_work() again because it has
>>>>>>> been aborted due to the previous error. Am I missing something?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yi.
>>>>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where
>>>>>>
>>>>>>   if(journal && !is_journal_destroying)
>>>>>>
>>>>>> is computed but schedule_work() not called yet, which is possible cause
>>>>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>>>>> is only called after we have done the flush then we end up with this:
>>>>>>
>>>>>>                                	      if (journal && !s_journal_destorying)
>>>>>>      ext4_journal_destroy()
>>>>>>       /* set s_journal_destorying */
>>>>>>       sbi->s_journal_destorying = true;
>>>>>>       flush_work(&sbi->s_sb_upd_work)
>>>>>>                                         schedule_work()
>>>>>>
>>>>>> Which is possible IMO, although the window is tiny.
>>>>> Yeah, right!
>>>>> Sorry for misread the location where you add the "!s_journal_destorying"
>>>>> check, the graph I provided was in update_super_work(), which was wrong.
>>>> Oh right, I also misread your trace but yes as discussed, even
>>>>
>>>>      sbi->s_journal_destorying = true;
>>>> 		flush_work()
>>>>      jbd2_journal_destroy()
>>>>
>>>> doesn't work.
>>>>
>>>>> The right one should be:
>>>>>
>>>>>   ext4_put_super()
>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>
>>>>>                      **kjournald2**
>>>>>                      jbd2_journal_commit_transaction()
>>>>>                      ...
>>>>>                      ext4_inode_error()
>>>>>                        /* s_journal_destorying is not set */
>>>>>                        if (journal && !s_journal_destorying)
>>>>>                          (schedule_work(s_sb_upd_work))  //can be here
>>>>>
>>>>>    ext4_journal_destroy()
>>>>>     /* set s_journal_destorying */
>>>>>     sbi->s_journal_destorying = true;
>>>>>     jbd2_journal_destroy()
>>>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>>>
>>>>>                          (schedule_work(s_sb_upd_work))  //also can be here
>>>>>
>>>>>                                    **workqueue**
>>>>>                                     update_super_work()
>>>>>                                     journal = sbi->s_journal //get journal
>>>>>      kfree(journal)
>>>>>                                       jbd2_journal_start(journal) //journal UAF
>>>>>                                         start_this_handle()
>>>>>                                           BUG_ON(JBD2_UNMOUNT) //bugon here
>>>>>
>>>>>
>>>>> So there are two problems here, the first one is the 'journal' UAF,
>>>>> the second one is triggering JBD2_UNMOUNT flag BUGON.
>>>> Indeed, there's a possible UAF here as well.
>>>>
>>>>>>>> As for the fix, how about we do something like this:
>>>>>>>>
>>>>>>>>    ext4_put_super()
>>>>>>>>
>>>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>>>     destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>>>>
>>>>>>>>     ext4_journal_destroy()
>>>>>>>>      /* set s_journal_destorying */
>>>>>>>>      sbi->s_journal_destorying = true;
>>>>>>>>
>>>>>>>>     /* trigger a commit and wait for it to complete */
>>>>>>>>
>>>>>>>>      flush_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>>      jbd2_journal_destroy()
>>>>>>>>       journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>   
>>>>>>>>                                          jbd2_journal_start()
>>>>>>>>                                           start_this_handle()
>>>>>>>>                                             BUG_ON(JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Still giving this codepath some thought but seems like this might just
>>>>>>>> be enough to fix the race. Thoughts on this?
>>>>>>>>
>>>>> I think this solution should work, the forced commit and flush_work()
>>>>> should ensure that the last transaction is committed and that the
>>>>> potential work is done.
>>>>>
>>>>> Besides, the s_journal_destorying flag is set and check concurrently
>>>>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>>>>> about adding a new flag into sbi->s_mount_state instead of adding
>>>>> new s_journal_destorying?
>>>> Right, that makes sence. I will incorporate these changes in the next
>>>> revision.
>>>>
>>> Think about this again, it seems that we no longer need the destroying
>>> flag. Because we force to commit and wait for the **last** transaction to
>>> complete, and the flush work should also ensure that the last sb_update
>>> work to complete. Regardless of whether it starts a new handle in the
>>> last update_super_work(), it will not commit since the journal should
>>> have aborted. What are your thoughts?
>>>
>>>   ext4_put_super()
>>>    flush_work(&sbi->s_sb_upd_work)
>>>    destroy_workqueue(sbi->rsv_conversion_wq)
>>>
>>>    ext4_journal_destroy()
>>>     /* trigger a commit (it will commit the last trnasaction) */
>>>
>>>                      **kjournald2**
>>>                      jbd2_journal_commit_transaction()
>>>                      ...
>>>                       ext4_inode_error()
>>>                        schedule_work(s_sb_upd_work))
>>>
>>>                                       **workqueue**
>>>                                        update_super_work()
>>>                                          jbd2_journal_start(journal)
>>>                                            start_this_handle()
>>>                                            //This new trans will
>>>                                            //not be committed.
>>>
>>>                       jbd2_journal_abort()
>>>
>>>     /* wait for it to complete */
>>>
>>>     flush_work(&sbi->s_sb_upd_work)
>>>     jbd2_journal_destroy()
>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>     jbd2_journal_commit_transaction() //it will commit nothing
>>>
>>> Thanks,
>>> Yi.
>> Hi Yi,
>>
>> There's one more path for which we need the flag:
>>
>>     ext4_journal_destroy()
>>      /* trigger a commit (it will commit the last trnasaction) */
>>   
>>                       **kjournald2**
>>                       jbd2_journal_commit_transaction()
>> 										 	journal->j_commit_callback()
>> 											  ext4_journal_commit_callback()
>> 												  ext4_maybe_update_superblock()
>> 													  schedule_work()
>> 			/* start a transaction here */
>> 			flush_work()
>> 			  jbd2_journal_destroy()
>> 				  journal_kill_thread
>> 					  flags |= JBD2_UNMOUNT
>> 				  jbd2_journal_commit_transaction()
>> 					  ...
>>              ext4_inode_error()
>> 							schedule_work(s_sb_upd_work))
>> 							/* update_super_work_tries to start the txn */
>> 							BUG_ON()
> Oops the formatting is wrong, here's the trace:
>
>   ext4_journal_destroy()
>      /* trigger a commit (it will commit the last trnasaction) */
>
>                     **kjournald2**
>                     jbd2_journal_commit_transaction()
>                      journal->j_commit_callback()
>                        ext4_journal_commit_callback()
>                          ext4_maybe_update_superblock()
>                            schedule_work()
At this point, SB_ACTIVE should have been cleared,
so ext4_maybe_update_superblock() should do nothing.

With this in mind, it could be the case that an
additional flag is no longer needed.


Regards,
Baokun
>
>      /* update_super_work starts a new txn here */
>      flush_work()
>      jbd2_journal_destroy()
>        journal_kill_thread
>          flags |= JBD2_UNMOUNT
>        jbd2_journal_commit_transaction()
>          ...
>          ext4_inode_error()
>            schedule_work(s_sb_upd_work))
>            /* update_super_work_tries to start the txn */
>            BUG_ON()
>
>> I think this to protect against this path we do need a flag.
>>
>> Regards,
>> ojaswin



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08 10:10                       ` Baokun Li
@ 2025-03-08 12:57                         ` Ojaswin Mujoo
  0 siblings, 0 replies; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-08 12:57 UTC (permalink / raw)
  To: Baokun Li
  Cc: Zhang Yi, Jan Kara, linux-kernel, Mahesh Kumar, linux-ext4,
	Theodore Ts'o

On Sat, Mar 08, 2025 at 06:10:14PM +0800, Baokun Li wrote:
> On 2025/3/8 17:58, Ojaswin Mujoo wrote:
> > On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote:
> > > On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
> > > > On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> > > > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> > > > > > On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> > > > > > > On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> > > > > > > > On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> > > > > > > > > On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> > > > > > > > > > On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> > > > > > > > > > > On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > > > > > > > > > > > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > > > > > > > > > > > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > > > > > > > > > > > stress tests, it was observed that in case of some error handling paths, it is
> > > > > > > > > > > > possible for update_super_work to start a transaction after the journal is
> > > > > > > > > > > > destroyed eg:
> > > > > > > > > > > > 
> > > > > > > > > > > > (umount)
> > > > > > > > > > > > ext4_kill_sb
> > > > > > > > > > > >    kill_block_super
> > > > > > > > > > > >      generic_shutdown_super
> > > > > > > > > > > >        sync_filesystem /* commits all txns */
> > > > > > > > > > > >        evict_inodes
> > > > > > > > > > > >          /* might start a new txn */
> > > > > > > > > > > >        ext4_put_super
> > > > > > > > > > > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > > > > > > > > > >          jbd2_journal_destroy
> > > > > > > > > > > >            journal_kill_thread
> > > > > > > > > > > >              journal->j_flags |= JBD2_UNMOUNT;
> > > > > > > > > > > >            jbd2_journal_commit_transaction
> > > > > > > > > > > >              jbd2_journal_get_descriptor_buffer
> > > > > > > > > > > >                jbd2_journal_bmap
> > > > > > > > > > > >                  ext4_journal_bmap
> > > > > > > > > > > >                    ext4_map_blocks
> > > > > > > > > > > >                      ...
> > > > > > > > > > > >                      ext4_inode_error
> > > > > > > > > > > >                        ext4_handle_error
> > > > > > > > > > > >                          schedule_work(&sbi->s_sb_upd_work)
> > > > > > > > > > > > 
> > > > > > > > > > > >                                                 /* work queue kicks in */
> > > > > > > > > > > >                                                 update_super_work
> > > > > > > > > > > >                                                   jbd2_journal_start
> > > > > > > > > > > >                                                     start_this_handle
> > > > > > > > > > > >                                                       BUG_ON(journal->j_flags &
> > > > > > > > > > > >                                                              JBD2_UNMOUNT)
> > > > > > > > > > > > 
> > > > > > > > > > > > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > > > > > > > > > > > destroying only do a journaled (and deferred) update of sb if this flag is not
> > > > > > > > > > > > set. Otherwise, just fallback to an un-journaled commit.
> > > > > > > > > > > > 
> > > > > > > > > > > > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > > > > > > > > > > > during ext4_put_super() (except a running transaction that will get commited
> > > > > > > > > > > > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > > > > > > > > > > > outside the journal as it won't race with a journaled update (refer
> > > > > > > > > > > > 2d01ddc86606).
> > > > > > > > > > > > 
> > > > > > > > > > > > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > > > > > > > > > > > called from mballoc and AFAICT it would be always valid to schedule work here.
> > > > > > > > > > > > 
> > > > > > > > > > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > > > > > > > > > > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > > > > > > > > > > > Suggested-by: Jan Kara <jack@suse.cz>
> > > > > > > > > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >   fs/ext4/ext4.h      | 2 ++
> > > > > > > > > > > >   fs/ext4/ext4_jbd2.h | 8 ++++++++
> > > > > > > > > > > >   fs/ext4/super.c     | 4 +++-
> > > > > > > > > > > >   3 files changed, 13 insertions(+), 1 deletion(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > > > > > > > > index 2b7d781bfcad..d48e93bd5690 100644
> > > > > > > > > > > > --- a/fs/ext4/ext4.h
> > > > > > > > > > > > +++ b/fs/ext4/ext4.h
> > > > > > > > > > > > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > > > > > > > > > > >   	 */
> > > > > > > > > > > >   	struct work_struct s_sb_upd_work;
> > > > > > > > > > > > +	bool s_journal_destorying;
> > > > > > > > > > > > +
> > > > > > > > > > > >   	/* Atomic write unit values in bytes */
> > > > > > > > > > > >   	unsigned int s_awu_min;
> > > > > > > > > > > >   	unsigned int s_awu_max;
> > > > > > > > > > > > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > > > > > > > > > > > index 9b3c9df02a39..6bd3ca84410d 100644
> > > > > > > > > > > > --- a/fs/ext4/ext4_jbd2.h
> > > > > > > > > > > > +++ b/fs/ext4/ext4_jbd2.h
> > > > > > > > > > > > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > > > > > > > > > > >   {
> > > > > > > > > > > >   	int err = 0;
> > > > > > > > > > > > +	/*
> > > > > > > > > > > > +	 * At this point all pending FS updates should be done except a possible
> > > > > > > > > > > > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > > > > > > > > > > > +	 * is now safe for any new errors to directly commit superblock rather
> > > > > > > > > > > > +	 * than going via journal.
> > > > > > > > > > > > +	 */
> > > > > > > > > > > > +	sbi->s_journal_destorying = true;
> > > > > > > > > > > > +
> > > > > > > > > > > Hi, Ojaswin!
> > > > > > > > > > > 
> > > > > > > > > > > I'm afraid you still need to flush the superblock update work here,
> > > > > > > > > > > otherwise I guess the race condition you mentioned in v1 could still
> > > > > > > > > > > occur.
> > > > > > > > > > > 
> > > > > > > > > > >   ext4_put_super()
> > > > > > > > > > >    flush_work(&sbi->s_sb_upd_work)
> > > > > > > > > > > 
> > > > > > > > > > >                      **kjournald2**
> > > > > > > > > > >                      jbd2_journal_commit_transaction()
> > > > > > > > > > >                      ...
> > > > > > > > > > >                      ext4_inode_error()
> > > > > > > > > > >                        /* JBD2_UNMOUNT not set */
> > > > > > > > > > >                        schedule_work(s_sb_upd_work)
> > > > > > > > > > > 
> > > > > > > > > > >                                    **workqueue**
> > > > > > > > > > >                                     update_super_work
> > > > > > > > > > >                                     /* s_journal_destorying is not set */
> > > > > > > > > > >                              	   if (journal && !s_journal_destorying)
> > > > > > > > > > > 
> > > > > > > > > > >    ext4_journal_destroy()
> > > > > > > > > > >     /* set s_journal_destorying */
> > > > > > > > > > >     sbi->s_journal_destorying = true;
> > > > > > > > > > >     jbd2_journal_destroy()
> > > > > > > > > > >      journal->j_flags |= JBD2_UNMOUNT;
> > > > > > > > > > > 
> > > > > > > > > > >                                         jbd2_journal_start()
> > > > > > > > > > >                                          start_this_handle()
> > > > > > > > > > >                                            BUG_ON(JBD2_UNMOUNT)
> > > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Yi.
> > > > > > > > > > Hi Yi,
> > > > > > > > > > 
> > > > > > > > > > Yes you are right, somehow missed this edge case :(
> > > > > > > > > > 
> > > > > > > > > > Alright then, we have to move out sbi->s_journal_destroying outside the
> > > > > > > > > > helper. Just wondering if I should still let it be in
> > > > > > > > > > ext4_journal_destroy and just add an extra s_journal_destroying = false
> > > > > > > > > > before schedule_work(s_sb_upd_work), because it makes sense.
> > > > > > > > > > 
> > > > > > > > > > Okay let me give it some thought but thanks for pointing this out!
> > > > > > > > > > 
> > > > > > > > > > Regards,
> > > > > > > > > > ojaswin
> > > > > > > > > Okay so thinking about it a bit more, I see you also suggested to flush
> > > > > > > > > the work after marking sbi->s_journal_destroying. But will that solve
> > > > > > > > > it?
> > > > > > > > > 
> > > > > > > > >    ext4_put_super()
> > > > > > > > >     flush_work(&sbi->s_sb_upd_work)
> > > > > > > > >                       **kjournald2**
> > > > > > > > >                       jbd2_journal_commit_transaction()
> > > > > > > > >                       ...
> > > > > > > > >                       ext4_inode_error()
> > > > > > > > >                         /* JBD2_UNMOUNT not set */
> > > > > > > > >                         schedule_work(s_sb_upd_work)
> > > > > > > > >                                      **workqueue**
> > > > > > > > >                                      update_super_work
> > > > > > > > >                                      /* s_journal_destorying is not set */
> > > > > > > > >                               	      if (journal && !s_journal_destorying)
> > > > > > > > >     ext4_journal_destroy()
> > > > > > > > >      /* set s_journal_destorying */
> > > > > > > > >      sbi->s_journal_destorying = true;
> > > > > > > > >      flush_work(&sbi->s_sb_upd_work)
> > > > > > > > >                                        schedule_work()
> > > > > > > >                                          ^^^^^^^^^^^^^^^
> > > > > > > >                                          where does this come from?
> > > > > > > > 
> > > > > > > > After this flush_work, we can guarantee that the running s_sb_upd_work
> > > > > > > > finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> > > > > > > > not commit transaction or call schedule_work() again because it has
> > > > > > > > been aborted due to the previous error. Am I missing something?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Yi.
> > > > > > > Hmm, so I am thinking of a corner case in ext4_handle_error() where
> > > > > > > 
> > > > > > >   if(journal && !is_journal_destroying)
> > > > > > > 
> > > > > > > is computed but schedule_work() not called yet, which is possible cause
> > > > > > > the cmp followed by jump is not atomic in nature. If the schedule_work
> > > > > > > is only called after we have done the flush then we end up with this:
> > > > > > > 
> > > > > > >                                	      if (journal && !s_journal_destorying)
> > > > > > >      ext4_journal_destroy()
> > > > > > >       /* set s_journal_destorying */
> > > > > > >       sbi->s_journal_destorying = true;
> > > > > > >       flush_work(&sbi->s_sb_upd_work)
> > > > > > >                                         schedule_work()
> > > > > > > 
> > > > > > > Which is possible IMO, although the window is tiny.
> > > > > > Yeah, right!
> > > > > > Sorry for misread the location where you add the "!s_journal_destorying"
> > > > > > check, the graph I provided was in update_super_work(), which was wrong.
> > > > > Oh right, I also misread your trace but yes as discussed, even
> > > > > 
> > > > >      sbi->s_journal_destorying = true;
> > > > > 		flush_work()
> > > > >      jbd2_journal_destroy()
> > > > > 
> > > > > doesn't work.
> > > > > 
> > > > > > The right one should be:
> > > > > > 
> > > > > >   ext4_put_super()
> > > > > >    flush_work(&sbi->s_sb_upd_work)
> > > > > > 
> > > > > >                      **kjournald2**
> > > > > >                      jbd2_journal_commit_transaction()
> > > > > >                      ...
> > > > > >                      ext4_inode_error()
> > > > > >                        /* s_journal_destorying is not set */
> > > > > >                        if (journal && !s_journal_destorying)
> > > > > >                          (schedule_work(s_sb_upd_work))  //can be here
> > > > > > 
> > > > > >    ext4_journal_destroy()
> > > > > >     /* set s_journal_destorying */
> > > > > >     sbi->s_journal_destorying = true;
> > > > > >     jbd2_journal_destroy()
> > > > > >      journal->j_flags |= JBD2_UNMOUNT;
> > > > > > 
> > > > > >                          (schedule_work(s_sb_upd_work))  //also can be here
> > > > > > 
> > > > > >                                    **workqueue**
> > > > > >                                     update_super_work()
> > > > > >                                     journal = sbi->s_journal //get journal
> > > > > >      kfree(journal)
> > > > > >                                       jbd2_journal_start(journal) //journal UAF
> > > > > >                                         start_this_handle()
> > > > > >                                           BUG_ON(JBD2_UNMOUNT) //bugon here
> > > > > > 
> > > > > > 
> > > > > > So there are two problems here, the first one is the 'journal' UAF,
> > > > > > the second one is triggering JBD2_UNMOUNT flag BUGON.
> > > > > Indeed, there's a possible UAF here as well.
> > > > > 
> > > > > > > > > As for the fix, how about we do something like this:
> > > > > > > > > 
> > > > > > > > >    ext4_put_super()
> > > > > > > > > 
> > > > > > > > >     flush_work(&sbi->s_sb_upd_work)
> > > > > > > > >     destroy_workqueue(sbi->rsv_conversion_wq);
> > > > > > > > > 
> > > > > > > > >     ext4_journal_destroy()
> > > > > > > > >      /* set s_journal_destorying */
> > > > > > > > >      sbi->s_journal_destorying = true;
> > > > > > > > > 
> > > > > > > > >     /* trigger a commit and wait for it to complete */
> > > > > > > > > 
> > > > > > > > >      flush_work(&sbi->s_sb_upd_work)
> > > > > > > > > 
> > > > > > > > >      jbd2_journal_destroy()
> > > > > > > > >       journal->j_flags |= JBD2_UNMOUNT;
> > > > > > > > >                                          jbd2_journal_start()
> > > > > > > > >                                           start_this_handle()
> > > > > > > > >                                             BUG_ON(JBD2_UNMOUNT)
> > > > > > > > > 
> > > > > > > > > Still giving this codepath some thought but seems like this might just
> > > > > > > > > be enough to fix the race. Thoughts on this?
> > > > > > > > > 
> > > > > > I think this solution should work, the forced commit and flush_work()
> > > > > > should ensure that the last transaction is committed and that the
> > > > > > potential work is done.
> > > > > > 
> > > > > > Besides, the s_journal_destorying flag is set and check concurrently
> > > > > > now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> > > > > > about adding a new flag into sbi->s_mount_state instead of adding
> > > > > > new s_journal_destorying?
> > > > > Right, that makes sence. I will incorporate these changes in the next
> > > > > revision.
> > > > > 
> > > > Think about this again, it seems that we no longer need the destroying
> > > > flag. Because we force to commit and wait for the **last** transaction to
> > > > complete, and the flush work should also ensure that the last sb_update
> > > > work to complete. Regardless of whether it starts a new handle in the
> > > > last update_super_work(), it will not commit since the journal should
> > > > have aborted. What are your thoughts?
> > > > 
> > > >   ext4_put_super()
> > > >    flush_work(&sbi->s_sb_upd_work)
> > > >    destroy_workqueue(sbi->rsv_conversion_wq)
> > > > 
> > > >    ext4_journal_destroy()
> > > >     /* trigger a commit (it will commit the last trnasaction) */
> > > > 
> > > >                      **kjournald2**
> > > >                      jbd2_journal_commit_transaction()
> > > >                      ...
> > > >                       ext4_inode_error()
> > > >                        schedule_work(s_sb_upd_work))
> > > > 
> > > >                                       **workqueue**
> > > >                                        update_super_work()
> > > >                                          jbd2_journal_start(journal)
> > > >                                            start_this_handle()
> > > >                                            //This new trans will
> > > >                                            //not be committed.
> > > > 
> > > >                       jbd2_journal_abort()
> > > > 
> > > >     /* wait for it to complete */
> > > > 
> > > >     flush_work(&sbi->s_sb_upd_work)
> > > >     jbd2_journal_destroy()
> > > >      journal->j_flags |= JBD2_UNMOUNT;
> > > >     jbd2_journal_commit_transaction() //it will commit nothing
> > > > 
> > > > Thanks,
> > > > Yi.
> > > Hi Yi,
> > > 
> > > There's one more path for which we need the flag:
> > > 
> > >     ext4_journal_destroy()
> > >      /* trigger a commit (it will commit the last trnasaction) */
> > >                       **kjournald2**
> > >                       jbd2_journal_commit_transaction()
> > > 										 	journal->j_commit_callback()
> > > 											  ext4_journal_commit_callback()
> > > 												  ext4_maybe_update_superblock()
> > > 													  schedule_work()
> > > 			/* start a transaction here */
> > > 			flush_work()
> > > 			  jbd2_journal_destroy()
> > > 				  journal_kill_thread
> > > 					  flags |= JBD2_UNMOUNT
> > > 				  jbd2_journal_commit_transaction()
> > > 					  ...
> > >              ext4_inode_error()
> > > 							schedule_work(s_sb_upd_work))
> > > 							/* update_super_work_tries to start the txn */
> > > 							BUG_ON()
> > Oops the formatting is wrong, here's the trace:
> > 
> >   ext4_journal_destroy()
> >      /* trigger a commit (it will commit the last trnasaction) */
> > 
> >                     **kjournald2**
> >                     jbd2_journal_commit_transaction()
> >                      journal->j_commit_callback()
> >                        ext4_journal_commit_callback()
> >                          ext4_maybe_update_superblock()
> >                            schedule_work()
> At this point, SB_ACTIVE should have been cleared,
> so ext4_maybe_update_superblock() should do nothing.
> 
> With this in mind, it could be the case that an
> additional flag is no longer needed.

Yes got it now, all the backtraces are confusing me haha

Alright then, I feel we can take 2 approaches now. 

- Without the flag, (flushing sb update wq 2 times):

This approach looks like how Yi has described here [1] however
this relies on the fact that the last update_super_work is only
called in journal is aborted and hence will never start a txn.

This is possible because we flush the sb 2 times:

  ext4_put_super()
    flush_work(s_sb_upd_work) // 1st flush, clears all pending updates
    ext4_journal_destroy
      /* commit & wait for txn */
      flush_work(s_sb_upd_work) // 2nd flush, only has work if journal is aborted */

The first flush flushes all pending updates and 2nd one is only invoked
when the journal commit faces an error and hence never starts a new
txn. Thus everything works.

[1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@linux.ibm.com/T/#m3de5dcae6afa979d01281f64b0c088131e72fc92

- With the flag. (single sb flush).

This is an alternate approach where we remove the flush_work() from
ext4_put_super() and only keep 1 in ext4_journal_destroy(). With this
we need to explicitly rely on a sbi/mount flag because the flush can
actually start a txn. WITHOUT the flag we can run into this:

                         **kjournald2**
                         jbd2_journal_commit_transaction()
                          journal->j_commit_callback()
                            ext4_journal_commit_callback()
                              ext4_maybe_update_superblock()
                                schedule_work()

  ext4_put_super()
    ext4_journal_destroy
      /* commit & wait for txn */
      flush_work(s_sb_upd_work) // This will have sb updates even if journal not aborted
        /* update_super_work starts a txn */
      jbd2_destroy_journal
        flags |= JBD2_UNMOUNT
        jbd2_journal_commit_transaction
          /* error */
          ext4_handle_error
            scheduled_work()
              /*update super work hits BUG_ON */

Hence the flag is needed in this approach.

I actually prefer the 2nd approach with the flag because it seems easier
to maintain in the long run than relying on the fragile
flush-commit-flush sequence, which might get silently broken as new code
is added.

I hope I didn't miss something this time. Thoughts? :) 

Regards,
ojaswin
> 
> 
> Regards,
> Baokun
> > 
> >      /* update_super_work starts a new txn here */
> >      flush_work()
> >      jbd2_journal_destroy()
> >        journal_kill_thread
> >          flags |= JBD2_UNMOUNT
> >        jbd2_journal_commit_transaction()
> >          ...
> >          ext4_inode_error()
> >            schedule_work(s_sb_upd_work))
> >            /* update_super_work_tries to start the txn */
> >            BUG_ON()
> > 
> > > I think this to protect against this path we do need a flag.
> > > 
> > > Regards,
> > > ojaswin
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08  9:55   ` Ritesh Harjani (IBM)
@ 2025-03-08 13:05     ` Ojaswin Mujoo
  2025-03-08 13:26       ` Ritesh Harjani
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-08 13:05 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> 
> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > stress tests, it was observed that in case of some error handling paths, it is
> > possible for update_super_work to start a transaction after the journal is
> > destroyed eg:
> >
> > (umount)
> > ext4_kill_sb
> >   kill_block_super
> >     generic_shutdown_super
> >       sync_filesystem /* commits all txns */
> >       evict_inodes
> >         /* might start a new txn */
> >       ext4_put_super
> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >         jbd2_journal_destroy
> >           journal_kill_thread
> >             journal->j_flags |= JBD2_UNMOUNT;
> >           jbd2_journal_commit_transaction
> >             jbd2_journal_get_descriptor_buffer
> >               jbd2_journal_bmap
> >                 ext4_journal_bmap
> >                   ext4_map_blocks
> >                     ...
> >                     ext4_inode_error
> >                       ext4_handle_error
> >                         schedule_work(&sbi->s_sb_upd_work)
> >
> >                                                /* work queue kicks in */
> >                                                update_super_work
> >                                                  jbd2_journal_start
> >                                                    start_this_handle
> >                                                      BUG_ON(journal->j_flags &
> >                                                             JBD2_UNMOUNT)
> >
> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > destroying only do a journaled (and deferred) update of sb if this flag is not
> > set. Otherwise, just fallback to an un-journaled commit.
> >
> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > during ext4_put_super() (except a running transaction that will get commited
> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > outside the journal as it won't race with a journaled update (refer
> > 2d01ddc86606).
> >
> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > called from mballoc and AFAICT it would be always valid to schedule work here.
> >
> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > ---
> >  fs/ext4/ext4.h      | 2 ++
> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >  fs/ext4/super.c     | 4 +++-
> >  3 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 2b7d781bfcad..d48e93bd5690 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >  	 */
> >  	struct work_struct s_sb_upd_work;
> >  
> > +	bool s_journal_destorying;
> > +
> >  	/* Atomic write unit values in bytes */
> >  	unsigned int s_awu_min;
> >  	unsigned int s_awu_max;
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 9b3c9df02a39..6bd3ca84410d 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >  {
> >  	int err = 0;
> >  
> > +	/*
> > +	 * At this point all pending FS updates should be done except a possible
> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > +	 * is now safe for any new errors to directly commit superblock rather
> > +	 * than going via journal.
> > +	 */
> > +	sbi->s_journal_destorying = true;
> 
> This is not correct right. I think what we decided to set this flag
> before we flush the workqueue. So that we don't schedule any new
> work after this flag has been set. At least that is what I understood.
> 
> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
> 
> -ritesh

Hey Ritesh,

Yes that is not correct, I missed that in my patch however we realised
that adding it before flush_work() also has issues [1]. More
specifically:

                     **kjournald2**
                     jbd2_journal_commit_transaction()
                     ...
                     ext4_handle_error()
                        /* s_journal_destorying is not set */
                        if (journal && !s_journal_destorying)
  ext4_put_super()
    sbi->s_journal_destorying = true;
    flush_work(&sbi->s_sb_upd_work)
                                      schedule_work()
    jbd2_journal_destroy()
     journal->j_flags |= JBD2_UNMOUNT;

                                        jbd2_journal_start()
                                         start_this_handle()
                                           BUG_ON(JBD2_UNMOUNT)

So the right thing to do seems to be that we need to force a journal
commit before the final flush as well. [1] Has more info on this and
some followup discussion as well.

[1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@linux.ibm.com/T/#mc8046d47b357665bdbd2878c91e51eb660f94b3e

Regards,
ojaswin
> 
> 
> > +
> >  	err = jbd2_journal_destroy(journal);
> >  	sbi->s_journal = NULL;
> >  
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 8ad664d47806..31552cf0519a 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >  		 * constraints, it may not be safe to do it right here so we
> >  		 * defer superblock flushing to a workqueue.
> >  		 */
> > -		if (continue_fs && journal)
> > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >  		else
> >  			ext4_commit_super(sb);
> > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> >  	spin_lock_init(&sbi->s_error_lock);
> >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
> >  
> > +	sbi->s_journal_destorying = false;
> > +
> >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
> >  	if (err)
> >  		goto failed_mount3;
> > -- 
> > 2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle
  2025-03-08  1:09 ` [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Baokun Li
@ 2025-03-08 13:07   ` Ojaswin Mujoo
  2025-03-08 15:21     ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-08 13:07 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel, Yang Erkun

On Sat, Mar 08, 2025 at 09:09:28AM +0800, Baokun Li wrote:
> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > ** Changes since v1 [1] **
> > 
> >   * Picked up RVBs from Jan and Ritesh
> >   * In patch 2/3, we now use a flag in sbi instead of SB_ACITVE
> >     to determine when to journal sb vs when to commit directly.
> >   * Added a prep patch 1/3
> > 
> > [1] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#m5e659425b8c8fe2ac01e7242b77fed315ff89db4
> > 
> > @Baokun, I didn't get a chance to look into the journal_inode
> > modifications we were discussing in [2]. I'll try to spend some time and
> > send that as a separate patch. Hope that's okay
> > 
> > [2] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#mad8feb44d9b6ddadf87830b92caa7b78d902dc05
> That's fine, it's not a priority. And if this patch set makes sure we
> don't crash when things go wrong, I'm okay with leaving it as is.
> 
> It's possible that jbd2_journal_commit_transaction() could call
> ext4_handle_error() in other places as the code evolves. Fixing known
> problems and protecting against potential ones is always a good thing.

Yep thats true, I did spend some time on this since the codepath was a
bit unfamiliar to me. Seems like a straighforward enough change. I'll
add it to the next patch.

thanks,
ojaswin
> 
> 
> Cheers,
> Baokun
> > ** Original Cover **
> > 
> > When running LTP stress tests on ext4, after a multiday run we seemed to
> > have hit the following BUG_ON:
> > 
> >   [NIP  : start_this_handle+268]
> >   #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2]  (unreliable)
> >   #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
> >   #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
> >   #6 [c000001067c27c70] process_one_work at c000000000188790
> >   #7 [c000001067c27d20] worker_thread at c00000000018973c
> >   #8 [c000001067c27dc0] kthread at c000000000196c84
> >   #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64
> > 
> > Which comes out to
> > 
> >    382   repeat:
> >    383           read_lock(&journal->j_state_lock);
> > * 384           BUG_ON(journal->j_flags & JBD2_UNMOUNT);
> >    385           if (is_journal_aborted(journal) ||
> >    386               (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
> >    387                   read_unlock(&journal->j_state_lock);
> > 
> > 
> > Initially this seemed like it should never happen but upon crash
> > analysis it seems like it could indeed be hit as described in patch 1/2.
> > 
> > I would like to add that through the logs we only knew that:
> > 
> > - ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
> > - update_super_work had hit the BUG_ON
> > 
> > I was not able to hit this bug again (without modifying the kernel to
> > inject errors) but the above backtrace seems to be one possible paths
> > where this BUG_ON can be hit. Rest of the analysis and fix is in patch
> > 2/3. Patch 3 is just a small tweak that i found helpful while debugging.
> > 
> > That being said, journalling is something I'm not very familiar with and
> > there might be gaps in my understanding so thoughts and suggestions are
> > welcome.
> > 
> > Ojaswin Mujoo (3):
> >    ext4: define ext4_journal_destroy wrapper
> >    ext4: avoid journaling sb update on error if journal is destroying
> >    ext4: Make sb update interval tunable
> > 
> >   fs/ext4/ext4.h      | 11 +++++++++++
> >   fs/ext4/ext4_jbd2.h | 22 ++++++++++++++++++++++
> >   fs/ext4/super.c     | 35 +++++++++++++++++------------------
> >   fs/ext4/sysfs.c     |  4 ++++
> >   4 files changed, 54 insertions(+), 18 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08 13:05     ` Ojaswin Mujoo
@ 2025-03-08 13:26       ` Ritesh Harjani
  2025-03-08 14:58         ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Ritesh Harjani @ 2025-03-08 13:26 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:

> On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>> 
>> > Presently we always BUG_ON if trying to start a transaction on a journal marked
>> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>> > stress tests, it was observed that in case of some error handling paths, it is
>> > possible for update_super_work to start a transaction after the journal is
>> > destroyed eg:
>> >
>> > (umount)
>> > ext4_kill_sb
>> >   kill_block_super
>> >     generic_shutdown_super
>> >       sync_filesystem /* commits all txns */
>> >       evict_inodes
>> >         /* might start a new txn */
>> >       ext4_put_super
>> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>> >         jbd2_journal_destroy
>> >           journal_kill_thread
>> >             journal->j_flags |= JBD2_UNMOUNT;
>> >           jbd2_journal_commit_transaction
>> >             jbd2_journal_get_descriptor_buffer
>> >               jbd2_journal_bmap
>> >                 ext4_journal_bmap
>> >                   ext4_map_blocks
>> >                     ...
>> >                     ext4_inode_error
>> >                       ext4_handle_error
>> >                         schedule_work(&sbi->s_sb_upd_work)
>> >
>> >                                                /* work queue kicks in */
>> >                                                update_super_work
>> >                                                  jbd2_journal_start
>> >                                                    start_this_handle
>> >                                                      BUG_ON(journal->j_flags &
>> >                                                             JBD2_UNMOUNT)
>> >
>> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>> > destroying only do a journaled (and deferred) update of sb if this flag is not
>> > set. Otherwise, just fallback to an un-journaled commit.
>> >
>> > We set sbi->s_journal_destroying = true only after all the FS updates are done
>> > during ext4_put_super() (except a running transaction that will get commited
>> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>> > outside the journal as it won't race with a journaled update (refer
>> > 2d01ddc86606).
>> >
>> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
>> > called from mballoc and AFAICT it would be always valid to schedule work here.
>> >
>> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>> > Suggested-by: Jan Kara <jack@suse.cz>
>> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> > ---
>> >  fs/ext4/ext4.h      | 2 ++
>> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
>> >  fs/ext4/super.c     | 4 +++-
>> >  3 files changed, 13 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> > index 2b7d781bfcad..d48e93bd5690 100644
>> > --- a/fs/ext4/ext4.h
>> > +++ b/fs/ext4/ext4.h
>> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>> >  	 */
>> >  	struct work_struct s_sb_upd_work;
>> >  
>> > +	bool s_journal_destorying;
>> > +
>> >  	/* Atomic write unit values in bytes */
>> >  	unsigned int s_awu_min;
>> >  	unsigned int s_awu_max;
>> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> > index 9b3c9df02a39..6bd3ca84410d 100644
>> > --- a/fs/ext4/ext4_jbd2.h
>> > +++ b/fs/ext4/ext4_jbd2.h
>> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>> >  {
>> >  	int err = 0;
>> >  
>> > +	/*
>> > +	 * At this point all pending FS updates should be done except a possible
>> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
>> > +	 * is now safe for any new errors to directly commit superblock rather
>> > +	 * than going via journal.
>> > +	 */
>> > +	sbi->s_journal_destorying = true;
>> 
>> This is not correct right. I think what we decided to set this flag
>> before we flush the workqueue. So that we don't schedule any new
>> work after this flag has been set. At least that is what I understood.
>> 
>> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
>> 
>> -ritesh
>
> Hey Ritesh,
>
> Yes that is not correct, I missed that in my patch however we realised
> that adding it before flush_work() also has issues [1]. More
> specifically:

Ohk. right. 

>
>                      **kjournald2**
>                      jbd2_journal_commit_transaction()
>                      ...
>                      ext4_handle_error()
>                         /* s_journal_destorying is not set */
>                         if (journal && !s_journal_destorying)

Then maybe we should not schedule another work to update the superblock
via journalling, it the error itself occurred while were trying to
commit the journal txn? 


-ritesh


>   ext4_put_super()
>     sbi->s_journal_destorying = true;
>     flush_work(&sbi->s_sb_upd_work)
>                                       schedule_work()
>     jbd2_journal_destroy()
>      journal->j_flags |= JBD2_UNMOUNT;
>
>                                         jbd2_journal_start()
>                                          start_this_handle()
>                                            BUG_ON(JBD2_UNMOUNT)
>
> So the right thing to do seems to be that we need to force a journal
> commit before the final flush as well. [1] Has more info on this and
> some followup discussion as well.
>
> [1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@linux.ibm.com/T/#mc8046d47b357665bdbd2878c91e51eb660f94b3e
>
> Regards,
> ojaswin
>> 
>> 
>> > +
>> >  	err = jbd2_journal_destroy(journal);
>> >  	sbi->s_journal = NULL;
>> >  
>> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> > index 8ad664d47806..31552cf0519a 100644
>> > --- a/fs/ext4/super.c
>> > +++ b/fs/ext4/super.c
>> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>> >  		 * constraints, it may not be safe to do it right here so we
>> >  		 * defer superblock flushing to a workqueue.
>> >  		 */
>> > -		if (continue_fs && journal)
>> > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>> >  		else
>> >  			ext4_commit_super(sb);
>> > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>> >  	spin_lock_init(&sbi->s_error_lock);
>> >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>> >  
>> > +	sbi->s_journal_destorying = false;
>> > +
>> >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>> >  	if (err)
>> >  		goto failed_mount3;
>> > -- 
>> > 2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08 13:26       ` Ritesh Harjani
@ 2025-03-08 14:58         ` Ojaswin Mujoo
  2025-03-08 18:41           ` Ritesh Harjani
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-08 14:58 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> 
> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> >> 
> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >> > stress tests, it was observed that in case of some error handling paths, it is
> >> > possible for update_super_work to start a transaction after the journal is
> >> > destroyed eg:
> >> >
> >> > (umount)
> >> > ext4_kill_sb
> >> >   kill_block_super
> >> >     generic_shutdown_super
> >> >       sync_filesystem /* commits all txns */
> >> >       evict_inodes
> >> >         /* might start a new txn */
> >> >       ext4_put_super
> >> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >> >         jbd2_journal_destroy
> >> >           journal_kill_thread
> >> >             journal->j_flags |= JBD2_UNMOUNT;
> >> >           jbd2_journal_commit_transaction
> >> >             jbd2_journal_get_descriptor_buffer
> >> >               jbd2_journal_bmap
> >> >                 ext4_journal_bmap
> >> >                   ext4_map_blocks
> >> >                     ...
> >> >                     ext4_inode_error
> >> >                       ext4_handle_error
> >> >                         schedule_work(&sbi->s_sb_upd_work)
> >> >
> >> >                                                /* work queue kicks in */
> >> >                                                update_super_work
> >> >                                                  jbd2_journal_start
> >> >                                                    start_this_handle
> >> >                                                      BUG_ON(journal->j_flags &
> >> >                                                             JBD2_UNMOUNT)
> >> >
> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
> >> > set. Otherwise, just fallback to an un-journaled commit.
> >> >
> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> >> > during ext4_put_super() (except a running transaction that will get commited
> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >> > outside the journal as it won't race with a journaled update (refer
> >> > 2d01ddc86606).
> >> >
> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
> >> >
> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> >> > Suggested-by: Jan Kara <jack@suse.cz>
> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >> > ---
> >> >  fs/ext4/ext4.h      | 2 ++
> >> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >> >  fs/ext4/super.c     | 4 +++-
> >> >  3 files changed, 13 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> > index 2b7d781bfcad..d48e93bd5690 100644
> >> > --- a/fs/ext4/ext4.h
> >> > +++ b/fs/ext4/ext4.h
> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >> >  	 */
> >> >  	struct work_struct s_sb_upd_work;
> >> >  
> >> > +	bool s_journal_destorying;
> >> > +
> >> >  	/* Atomic write unit values in bytes */
> >> >  	unsigned int s_awu_min;
> >> >  	unsigned int s_awu_max;
> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >> > index 9b3c9df02a39..6bd3ca84410d 100644
> >> > --- a/fs/ext4/ext4_jbd2.h
> >> > +++ b/fs/ext4/ext4_jbd2.h
> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >> >  {
> >> >  	int err = 0;
> >> >  
> >> > +	/*
> >> > +	 * At this point all pending FS updates should be done except a possible
> >> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> >> > +	 * is now safe for any new errors to directly commit superblock rather
> >> > +	 * than going via journal.
> >> > +	 */
> >> > +	sbi->s_journal_destorying = true;
> >> 
> >> This is not correct right. I think what we decided to set this flag
> >> before we flush the workqueue. So that we don't schedule any new
> >> work after this flag has been set. At least that is what I understood.
> >> 
> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
> >> 
> >> -ritesh
> >
> > Hey Ritesh,
> >
> > Yes that is not correct, I missed that in my patch however we realised
> > that adding it before flush_work() also has issues [1]. More
> > specifically:
> 
> Ohk. right. 
> 
> >
> >                      **kjournald2**
> >                      jbd2_journal_commit_transaction()
> >                      ...
> >                      ext4_handle_error()
> >                         /* s_journal_destorying is not set */
> >                         if (journal && !s_journal_destorying)
> 
> Then maybe we should not schedule another work to update the superblock
> via journalling, it the error itself occurred while were trying to
> commit the journal txn? 
> 
> 
> -ritesh

Hmm, ideally yes that should not happen, but how can we achieve that?
For example with the trace we saw:

   **kjournald2**
   jbd2_journal_commit_transaction()
     jbd2_journal_get_descriptor_buffer
       jbd2_journal_bmap
         ext4_journal_bmap
           ext4_map_blocks
             ...
             ext4_inode_error
               ext4_handle_error
                 schedule_work(&sbi->s_sb_upd_work)

How do we tell ext4_handle_error that it is in the context of a
committing txn. We can't pass down an argument all the way down 
cause that is not feasible. An sb level flag will also not work
I think. Any thoughts on this?

regards,
ojaswin

> 
> 
> >   ext4_put_super()
> >     sbi->s_journal_destorying = true;
> >     flush_work(&sbi->s_sb_upd_work)
> >                                       schedule_work()
> >     jbd2_journal_destroy()
> >      journal->j_flags |= JBD2_UNMOUNT;
> >
> >                                         jbd2_journal_start()
> >                                          start_this_handle()
> >                                            BUG_ON(JBD2_UNMOUNT)
> >
> > So the right thing to do seems to be that we need to force a journal
> > commit before the final flush as well. [1] Has more info on this and
> > some followup discussion as well.
> >
> > [1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@linux.ibm.com/T/#mc8046d47b357665bdbd2878c91e51eb660f94b3e
> >
> > Regards,
> > ojaswin
> >> 
> >> 
> >> > +
> >> >  	err = jbd2_journal_destroy(journal);
> >> >  	sbi->s_journal = NULL;
> >> >  
> >> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >> > index 8ad664d47806..31552cf0519a 100644
> >> > --- a/fs/ext4/super.c
> >> > +++ b/fs/ext4/super.c
> >> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >> >  		 * constraints, it may not be safe to do it right here so we
> >> >  		 * defer superblock flushing to a workqueue.
> >> >  		 */
> >> > -		if (continue_fs && journal)
> >> > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
> >> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >> >  		else
> >> >  			ext4_commit_super(sb);
> >> > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> >> >  	spin_lock_init(&sbi->s_error_lock);
> >> >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
> >> >  
> >> > +	sbi->s_journal_destorying = false;
> >> > +
> >> >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
> >> >  	if (err)
> >> >  		goto failed_mount3;
> >> > -- 
> >> > 2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle
  2025-03-08 13:07   ` Ojaswin Mujoo
@ 2025-03-08 15:21     ` Ojaswin Mujoo
  2025-03-10  3:10       ` Baokun Li
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-08 15:21 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel, Yang Erkun

On Sat, Mar 08, 2025 at 06:37:41PM +0530, Ojaswin Mujoo wrote:
> On Sat, Mar 08, 2025 at 09:09:28AM +0800, Baokun Li wrote:
> > On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > > ** Changes since v1 [1] **
> > > 
> > >   * Picked up RVBs from Jan and Ritesh
> > >   * In patch 2/3, we now use a flag in sbi instead of SB_ACITVE
> > >     to determine when to journal sb vs when to commit directly.
> > >   * Added a prep patch 1/3
> > > 
> > > [1] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#m5e659425b8c8fe2ac01e7242b77fed315ff89db4
> > > 
> > > @Baokun, I didn't get a chance to look into the journal_inode
> > > modifications we were discussing in [2]. I'll try to spend some time and
> > > send that as a separate patch. Hope that's okay
> > > 
> > > [2] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#mad8feb44d9b6ddadf87830b92caa7b78d902dc05
> > That's fine, it's not a priority. And if this patch set makes sure we
> > don't crash when things go wrong, I'm okay with leaving it as is.
> > 
> > It's possible that jbd2_journal_commit_transaction() could call
> > ext4_handle_error() in other places as the code evolves. Fixing known
> > problems and protecting against potential ones is always a good thing.
> 
> Yep thats true, I did spend some time on this since the codepath was a
> bit unfamiliar to me. Seems like a straighforward enough change. I'll
> add it to the next patch.
> 
> thanks,
> ojaswin

Hey Baokun,

So while coding this up, I started looking at some codepaths and it got
me wondering when can we actually change the sbi->s_es->s_journal_inum
(or sbi->s_sbh) from the time it gets populated to the umount time?

Since the sbi->s_sbh buffer head is always in memory and never reclaimed
due the elevated reference, the only way to modify it should be if we
modify the memory page somehow. Or is there some codepath/tooling magic
I'm missing that can modify this value?

Regards,
ojaswin
> > 
> > 
> > Cheers,
> > Baokun
> > > ** Original Cover **
> > > 
> > > When running LTP stress tests on ext4, after a multiday run we seemed to
> > > have hit the following BUG_ON:
> > > 
> > >   [NIP  : start_this_handle+268]
> > >   #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2]  (unreliable)
> > >   #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
> > >   #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
> > >   #6 [c000001067c27c70] process_one_work at c000000000188790
> > >   #7 [c000001067c27d20] worker_thread at c00000000018973c
> > >   #8 [c000001067c27dc0] kthread at c000000000196c84
> > >   #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64
> > > 
> > > Which comes out to
> > > 
> > >    382   repeat:
> > >    383           read_lock(&journal->j_state_lock);
> > > * 384           BUG_ON(journal->j_flags & JBD2_UNMOUNT);
> > >    385           if (is_journal_aborted(journal) ||
> > >    386               (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
> > >    387                   read_unlock(&journal->j_state_lock);
> > > 
> > > 
> > > Initially this seemed like it should never happen but upon crash
> > > analysis it seems like it could indeed be hit as described in patch 1/2.
> > > 
> > > I would like to add that through the logs we only knew that:
> > > 
> > > - ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
> > > - update_super_work had hit the BUG_ON
> > > 
> > > I was not able to hit this bug again (without modifying the kernel to
> > > inject errors) but the above backtrace seems to be one possible paths
> > > where this BUG_ON can be hit. Rest of the analysis and fix is in patch
> > > 2/3. Patch 3 is just a small tweak that i found helpful while debugging.
> > > 
> > > That being said, journalling is something I'm not very familiar with and
> > > there might be gaps in my understanding so thoughts and suggestions are
> > > welcome.
> > > 
> > > Ojaswin Mujoo (3):
> > >    ext4: define ext4_journal_destroy wrapper
> > >    ext4: avoid journaling sb update on error if journal is destroying
> > >    ext4: Make sb update interval tunable
> > > 
> > >   fs/ext4/ext4.h      | 11 +++++++++++
> > >   fs/ext4/ext4_jbd2.h | 22 ++++++++++++++++++++++
> > >   fs/ext4/super.c     | 35 +++++++++++++++++------------------
> > >   fs/ext4/sysfs.c     |  4 ++++
> > >   4 files changed, 54 insertions(+), 18 deletions(-)
> > > 
> > 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08 14:58         ` Ojaswin Mujoo
@ 2025-03-08 18:41           ` Ritesh Harjani
  2025-03-09 12:07             ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Ritesh Harjani @ 2025-03-08 18:41 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:

> On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>> 
>> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
>> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>> >> 
>> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
>> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>> >> > stress tests, it was observed that in case of some error handling paths, it is
>> >> > possible for update_super_work to start a transaction after the journal is
>> >> > destroyed eg:
>> >> >
>> >> > (umount)
>> >> > ext4_kill_sb
>> >> >   kill_block_super
>> >> >     generic_shutdown_super
>> >> >       sync_filesystem /* commits all txns */
>> >> >       evict_inodes
>> >> >         /* might start a new txn */
>> >> >       ext4_put_super
>> >> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>> >> >         jbd2_journal_destroy
>> >> >           journal_kill_thread
>> >> >             journal->j_flags |= JBD2_UNMOUNT;
>> >> >           jbd2_journal_commit_transaction
>> >> >             jbd2_journal_get_descriptor_buffer
>> >> >               jbd2_journal_bmap
>> >> >                 ext4_journal_bmap
>> >> >                   ext4_map_blocks
>> >> >                     ...
>> >> >                     ext4_inode_error
>> >> >                       ext4_handle_error
>> >> >                         schedule_work(&sbi->s_sb_upd_work)
>> >> >
>> >> >                                                /* work queue kicks in */
>> >> >                                                update_super_work
>> >> >                                                  jbd2_journal_start
>> >> >                                                    start_this_handle
>> >> >                                                      BUG_ON(journal->j_flags &
>> >> >                                                             JBD2_UNMOUNT)
>> >> >
>> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
>> >> > set. Otherwise, just fallback to an un-journaled commit.
>> >> >
>> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
>> >> > during ext4_put_super() (except a running transaction that will get commited
>> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>> >> > outside the journal as it won't race with a journaled update (refer
>> >> > 2d01ddc86606).
>> >> >
>> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
>> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
>> >> >
>> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>> >> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>> >> > Suggested-by: Jan Kara <jack@suse.cz>
>> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> >> > ---
>> >> >  fs/ext4/ext4.h      | 2 ++
>> >> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
>> >> >  fs/ext4/super.c     | 4 +++-
>> >> >  3 files changed, 13 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> >> > index 2b7d781bfcad..d48e93bd5690 100644
>> >> > --- a/fs/ext4/ext4.h
>> >> > +++ b/fs/ext4/ext4.h
>> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>> >> >  	 */
>> >> >  	struct work_struct s_sb_upd_work;
>> >> >  
>> >> > +	bool s_journal_destorying;
>> >> > +
>> >> >  	/* Atomic write unit values in bytes */
>> >> >  	unsigned int s_awu_min;
>> >> >  	unsigned int s_awu_max;
>> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> >> > index 9b3c9df02a39..6bd3ca84410d 100644
>> >> > --- a/fs/ext4/ext4_jbd2.h
>> >> > +++ b/fs/ext4/ext4_jbd2.h
>> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>> >> >  {
>> >> >  	int err = 0;
>> >> >  
>> >> > +	/*
>> >> > +	 * At this point all pending FS updates should be done except a possible
>> >> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
>> >> > +	 * is now safe for any new errors to directly commit superblock rather
>> >> > +	 * than going via journal.
>> >> > +	 */
>> >> > +	sbi->s_journal_destorying = true;
>> >> 
>> >> This is not correct right. I think what we decided to set this flag
>> >> before we flush the workqueue. So that we don't schedule any new
>> >> work after this flag has been set. At least that is what I understood.
>> >> 
>> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
>> >> 
>> >> -ritesh
>> >
>> > Hey Ritesh,
>> >
>> > Yes that is not correct, I missed that in my patch however we realised
>> > that adding it before flush_work() also has issues [1]. More
>> > specifically:
>> 
>> Ohk. right. 
>> 
>> >
>> >                      **kjournald2**
>> >                      jbd2_journal_commit_transaction()
>> >                      ...
>> >                      ext4_handle_error()
>> >                         /* s_journal_destorying is not set */
>> >                         if (journal && !s_journal_destorying)
>> 
>> Then maybe we should not schedule another work to update the superblock
>> via journalling, it the error itself occurred while were trying to
>> commit the journal txn? 
>> 
>> 
>> -ritesh
>
> Hmm, ideally yes that should not happen, but how can we achieve that?
> For example with the trace we saw:
>
>    **kjournald2**
>    jbd2_journal_commit_transaction()
>      jbd2_journal_get_descriptor_buffer
>        jbd2_journal_bmap
>          ext4_journal_bmap
>            ext4_map_blocks
>              ...
>              ext4_inode_error
>                ext4_handle_error
>                  schedule_work(&sbi->s_sb_upd_work)
>
> How do we tell ext4_handle_error that it is in the context of a
> committing txn.

So even if we identify that the current
jbd2_journal_commit_transaction() is coming from kjournald2(), that is
sufficient right? Because the only other place where we call
jbd2_journal_commit_transaction() is jbd2_journal_destroy() and that
happens after we can set few things from ext4_put_super() and flush work
is completed, correct? 


> We can't pass down an argument all the way down 
> cause that is not feasible. An sb level flag will also not work
> I think. Any thoughts on this?

I was thinking if we should have a per task flag? Something like
PF_KJOURNALD?  (Similar to how we have PF_KSWAPD or PF_KCOMPACTD)? This
can help us identify if we are a kjournald2() kthread.

That will help prevent scheduling another work item to start a new
transaction in case an error occurs while committing the currently
running transaction. Correct?

Now I don't know if we have any free bit available in current->flags. If
not shall we use current->journal_info pointer to have 0th bit as a
flag? Basically override current->journal_info to also store a flag.  We
can create a wrapper to get the journal_info from current by masking
this flag bit and use it to dereference journal_info.

But before going down that road, it's better to know what others think?

-ritesh


>
> regards,
> ojaswin
>
>> 
>> 
>> >   ext4_put_super()
>> >     sbi->s_journal_destorying = true;
>> >     flush_work(&sbi->s_sb_upd_work)
>> >                                       schedule_work()
>> >     jbd2_journal_destroy()
>> >      journal->j_flags |= JBD2_UNMOUNT;
>> >
>> >                                         jbd2_journal_start()
>> >                                          start_this_handle()
>> >                                            BUG_ON(JBD2_UNMOUNT)
>> >
>> > So the right thing to do seems to be that we need to force a journal
>> > commit before the final flush as well. [1] Has more info on this and
>> > some followup discussion as well.
>> >
>> > [1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@linux.ibm.com/T/#mc8046d47b357665bdbd2878c91e51eb660f94b3e
>> >
>> > Regards,
>> > ojaswin
>> >> 
>> >> 
>> >> > +
>> >> >  	err = jbd2_journal_destroy(journal);
>> >> >  	sbi->s_journal = NULL;
>> >> >  
>> >> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> >> > index 8ad664d47806..31552cf0519a 100644
>> >> > --- a/fs/ext4/super.c
>> >> > +++ b/fs/ext4/super.c
>> >> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>> >> >  		 * constraints, it may not be safe to do it right here so we
>> >> >  		 * defer superblock flushing to a workqueue.
>> >> >  		 */
>> >> > -		if (continue_fs && journal)
>> >> > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>> >> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>> >> >  		else
>> >> >  			ext4_commit_super(sb);
>> >> > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>> >> >  	spin_lock_init(&sbi->s_error_lock);
>> >> >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>> >> >  
>> >> > +	sbi->s_journal_destorying = false;
>> >> > +
>> >> >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>> >> >  	if (err)
>> >> >  		goto failed_mount3;
>> >> > -- 
>> >> > 2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-08 18:41           ` Ritesh Harjani
@ 2025-03-09 12:07             ` Ojaswin Mujoo
  2025-03-10  4:43               ` Ritesh Harjani
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-09 12:07 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> 
> > On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> >> 
> >> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> >> >> 
> >> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> >> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >> >> > stress tests, it was observed that in case of some error handling paths, it is
> >> >> > possible for update_super_work to start a transaction after the journal is
> >> >> > destroyed eg:
> >> >> >
> >> >> > (umount)
> >> >> > ext4_kill_sb
> >> >> >   kill_block_super
> >> >> >     generic_shutdown_super
> >> >> >       sync_filesystem /* commits all txns */
> >> >> >       evict_inodes
> >> >> >         /* might start a new txn */
> >> >> >       ext4_put_super
> >> >> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >> >> >         jbd2_journal_destroy
> >> >> >           journal_kill_thread
> >> >> >             journal->j_flags |= JBD2_UNMOUNT;
> >> >> >           jbd2_journal_commit_transaction
> >> >> >             jbd2_journal_get_descriptor_buffer
> >> >> >               jbd2_journal_bmap
> >> >> >                 ext4_journal_bmap
> >> >> >                   ext4_map_blocks
> >> >> >                     ...
> >> >> >                     ext4_inode_error
> >> >> >                       ext4_handle_error
> >> >> >                         schedule_work(&sbi->s_sb_upd_work)
> >> >> >
> >> >> >                                                /* work queue kicks in */
> >> >> >                                                update_super_work
> >> >> >                                                  jbd2_journal_start
> >> >> >                                                    start_this_handle
> >> >> >                                                      BUG_ON(journal->j_flags &
> >> >> >                                                             JBD2_UNMOUNT)
> >> >> >
> >> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
> >> >> > set. Otherwise, just fallback to an un-journaled commit.
> >> >> >
> >> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> >> >> > during ext4_put_super() (except a running transaction that will get commited
> >> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >> >> > outside the journal as it won't race with a journaled update (refer
> >> >> > 2d01ddc86606).
> >> >> >
> >> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
> >> >> >
> >> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >> >> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> >> >> > Suggested-by: Jan Kara <jack@suse.cz>
> >> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >> >> > ---
> >> >> >  fs/ext4/ext4.h      | 2 ++
> >> >> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >> >> >  fs/ext4/super.c     | 4 +++-
> >> >> >  3 files changed, 13 insertions(+), 1 deletion(-)
> >> >> >
> >> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> >> > index 2b7d781bfcad..d48e93bd5690 100644
> >> >> > --- a/fs/ext4/ext4.h
> >> >> > +++ b/fs/ext4/ext4.h
> >> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >> >> >  	 */
> >> >> >  	struct work_struct s_sb_upd_work;
> >> >> >  
> >> >> > +	bool s_journal_destorying;
> >> >> > +
> >> >> >  	/* Atomic write unit values in bytes */
> >> >> >  	unsigned int s_awu_min;
> >> >> >  	unsigned int s_awu_max;
> >> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >> >> > index 9b3c9df02a39..6bd3ca84410d 100644
> >> >> > --- a/fs/ext4/ext4_jbd2.h
> >> >> > +++ b/fs/ext4/ext4_jbd2.h
> >> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >> >> >  {
> >> >> >  	int err = 0;
> >> >> >  
> >> >> > +	/*
> >> >> > +	 * At this point all pending FS updates should be done except a possible
> >> >> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> >> >> > +	 * is now safe for any new errors to directly commit superblock rather
> >> >> > +	 * than going via journal.
> >> >> > +	 */
> >> >> > +	sbi->s_journal_destorying = true;
> >> >> 
> >> >> This is not correct right. I think what we decided to set this flag
> >> >> before we flush the workqueue. So that we don't schedule any new
> >> >> work after this flag has been set. At least that is what I understood.
> >> >> 
> >> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
> >> >> 
> >> >> -ritesh
> >> >
> >> > Hey Ritesh,
> >> >
> >> > Yes that is not correct, I missed that in my patch however we realised
> >> > that adding it before flush_work() also has issues [1]. More
> >> > specifically:
> >> 
> >> Ohk. right. 
> >> 
> >> >
> >> >                      **kjournald2**
> >> >                      jbd2_journal_commit_transaction()
> >> >                      ...
> >> >                      ext4_handle_error()
> >> >                         /* s_journal_destorying is not set */
> >> >                         if (journal && !s_journal_destorying)
> >> 
> >> Then maybe we should not schedule another work to update the superblock
> >> via journalling, it the error itself occurred while were trying to
> >> commit the journal txn? 
> >> 
> >> 
> >> -ritesh
> >
> > Hmm, ideally yes that should not happen, but how can we achieve that?
> > For example with the trace we saw:
> >
> >    **kjournald2**
> >    jbd2_journal_commit_transaction()
> >      jbd2_journal_get_descriptor_buffer
> >        jbd2_journal_bmap
> >          ext4_journal_bmap
> >            ext4_map_blocks
> >              ...
> >              ext4_inode_error
> >                ext4_handle_error
> >                  schedule_work(&sbi->s_sb_upd_work)
> >
> > How do we tell ext4_handle_error that it is in the context of a
> > committing txn.
> 
> So even if we identify that the current
> jbd2_journal_commit_transaction() is coming from kjournald2(), that is
> sufficient right? Because the only other place where we call
> jbd2_journal_commit_transaction() is jbd2_journal_destroy() and that
> happens after we can set few things from ext4_put_super() and flush work
> is completed, correct? 
> 
> 
> > We can't pass down an argument all the way down 
> > cause that is not feasible. An sb level flag will also not work
> > I think. Any thoughts on this?
> 
> I was thinking if we should have a per task flag? Something like
> PF_KJOURNALD?  (Similar to how we have PF_KSWAPD or PF_KCOMPACTD)? This
> can help us identify if we are a kjournald2() kthread.
> 
> That will help prevent scheduling another work item to start a new
> transaction in case an error occurs while committing the currently
> running transaction. Correct?

Yes, I like this approach. I think this will also help us avoid the
extra checks in ext4_journal_destroy() since the journal will no longer
schecule work for updating the sb. Hence we can be sure that after the
final flush_work() noone will try to schedule more work or start a new
transaction.

I'll try to spin up a poc and test it. Does seem like we are out of flags in 
task struct though.

Regards,
ojaswin

> 
> Now I don't know if we have any free bit available in current->flags. If
> not shall we use current->journal_info pointer to have 0th bit as a
> flag? Basically override current->journal_info to also store a flag.  We
> can create a wrapper to get the journal_info from current by masking
> this flag bit and use it to dereference journal_info.

Hmm so journal_info will be holding a kernel address of the handle. Is
it possible to have it share a flag as well? I thought the address would
utilize the full 64bits?

Regards,
ojaswin

> 
> But before going down that road, it's better to know what others think?
> 
> -ritesh
> 
> 
> >
> > regards,
> > ojaswin
> >
> >> 
> >> 
> >> >   ext4_put_super()
> >> >     sbi->s_journal_destorying = true;
> >> >     flush_work(&sbi->s_sb_upd_work)
> >> >                                       schedule_work()
> >> >     jbd2_journal_destroy()
> >> >      journal->j_flags |= JBD2_UNMOUNT;
> >> >
> >> >                                         jbd2_journal_start()
> >> >                                          start_this_handle()
> >> >                                            BUG_ON(JBD2_UNMOUNT)
> >> >
> >> > So the right thing to do seems to be that we need to force a journal
> >> > commit before the final flush as well. [1] Has more info on this and
> >> > some followup discussion as well.
> >> >
> >> > [1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@linux.ibm.com/T/#mc8046d47b357665bdbd2878c91e51eb660f94b3e
> >> >
> >> > Regards,
> >> > ojaswin
> >> >> 
> >> >> 
> >> >> > +
> >> >> >  	err = jbd2_journal_destroy(journal);
> >> >> >  	sbi->s_journal = NULL;
> >> >> >  
> >> >> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >> >> > index 8ad664d47806..31552cf0519a 100644
> >> >> > --- a/fs/ext4/super.c
> >> >> > +++ b/fs/ext4/super.c
> >> >> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >> >> >  		 * constraints, it may not be safe to do it right here so we
> >> >> >  		 * defer superblock flushing to a workqueue.
> >> >> >  		 */
> >> >> > -		if (continue_fs && journal)
> >> >> > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
> >> >> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >> >> >  		else
> >> >> >  			ext4_commit_super(sb);
> >> >> > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> >> >> >  	spin_lock_init(&sbi->s_error_lock);
> >> >> >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
> >> >> >  
> >> >> > +	sbi->s_journal_destorying = false;
> >> >> > +
> >> >> >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
> >> >> >  	if (err)
> >> >> >  		goto failed_mount3;
> >> >> > -- 
> >> >> > 2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle
  2025-03-08 15:21     ` Ojaswin Mujoo
@ 2025-03-10  3:10       ` Baokun Li
  2025-03-10  7:59         ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Baokun Li @ 2025-03-10  3:10 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel, Yang Erkun

On 2025/3/8 23:21, Ojaswin Mujoo wrote:
> On Sat, Mar 08, 2025 at 06:37:41PM +0530, Ojaswin Mujoo wrote:
>> On Sat, Mar 08, 2025 at 09:09:28AM +0800, Baokun Li wrote:
>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>> ** Changes since v1 [1] **
>>>>
>>>>    * Picked up RVBs from Jan and Ritesh
>>>>    * In patch 2/3, we now use a flag in sbi instead of SB_ACITVE
>>>>      to determine when to journal sb vs when to commit directly.
>>>>    * Added a prep patch 1/3
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#m5e659425b8c8fe2ac01e7242b77fed315ff89db4
>>>>
>>>> @Baokun, I didn't get a chance to look into the journal_inode
>>>> modifications we were discussing in [2]. I'll try to spend some time and
>>>> send that as a separate patch. Hope that's okay
>>>>
>>>> [2] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#mad8feb44d9b6ddadf87830b92caa7b78d902dc05
>>> That's fine, it's not a priority. And if this patch set makes sure we
>>> don't crash when things go wrong, I'm okay with leaving it as is.
>>>
>>> It's possible that jbd2_journal_commit_transaction() could call
>>> ext4_handle_error() in other places as the code evolves. Fixing known
>>> problems and protecting against potential ones is always a good thing.
>> Yep thats true, I did spend some time on this since the codepath was a
>> bit unfamiliar to me. Seems like a straighforward enough change. I'll
>> add it to the next patch.
>>
>> thanks,
>> ojaswin
> Hey Baokun,
>
> So while coding this up, I started looking at some codepaths and it got
> me wondering when can we actually change the sbi->s_es->s_journal_inum
> (or sbi->s_sbh) from the time it gets populated to the umount time?
>
> Since the sbi->s_sbh buffer head is always in memory and never reclaimed
> due the elevated reference, the only way to modify it should be if we
> modify the memory page somehow. Or is there some codepath/tooling magic
> I'm missing that can modify this value?
For the ext4 file system, s_journal_inum is fixed after mkfs and is not
expected to be modified. However, the sbi->s_sbh buffer head belongs to
the block device, so direct write operations to the file system's block
device /dev/xxx may directly modify sbi->s_sbh.

For example, you can use the following script to trigger a journal bmap
failure:

mkfs.ext4 -F /dev/sda
mount /dev/sda /tmp/test
dd if=/dev/zero of=/dev/sda bs=4096 count=1
echo 1 > /tmp/test/file
umount /tmp/test

(Adding a delay in put_super and performing the script during the delay
  can trigger the issue described in the patch.)


Cheers,
Baokun
>
> Regards,
> ojaswin
>>>
>>> Cheers,
>>> Baokun
>>>> ** Original Cover **
>>>>
>>>> When running LTP stress tests on ext4, after a multiday run we seemed to
>>>> have hit the following BUG_ON:
>>>>
>>>>    [NIP  : start_this_handle+268]
>>>>    #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2]  (unreliable)
>>>>    #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
>>>>    #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
>>>>    #6 [c000001067c27c70] process_one_work at c000000000188790
>>>>    #7 [c000001067c27d20] worker_thread at c00000000018973c
>>>>    #8 [c000001067c27dc0] kthread at c000000000196c84
>>>>    #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64
>>>>
>>>> Which comes out to
>>>>
>>>>     382   repeat:
>>>>     383           read_lock(&journal->j_state_lock);
>>>> * 384           BUG_ON(journal->j_flags & JBD2_UNMOUNT);
>>>>     385           if (is_journal_aborted(journal) ||
>>>>     386               (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
>>>>     387                   read_unlock(&journal->j_state_lock);
>>>>
>>>>
>>>> Initially this seemed like it should never happen but upon crash
>>>> analysis it seems like it could indeed be hit as described in patch 1/2.
>>>>
>>>> I would like to add that through the logs we only knew that:
>>>>
>>>> - ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
>>>> - update_super_work had hit the BUG_ON
>>>>
>>>> I was not able to hit this bug again (without modifying the kernel to
>>>> inject errors) but the above backtrace seems to be one possible paths
>>>> where this BUG_ON can be hit. Rest of the analysis and fix is in patch
>>>> 2/3. Patch 3 is just a small tweak that i found helpful while debugging.
>>>>
>>>> That being said, journalling is something I'm not very familiar with and
>>>> there might be gaps in my understanding so thoughts and suggestions are
>>>> welcome.
>>>>
>>>> Ojaswin Mujoo (3):
>>>>     ext4: define ext4_journal_destroy wrapper
>>>>     ext4: avoid journaling sb update on error if journal is destroying
>>>>     ext4: Make sb update interval tunable
>>>>
>>>>    fs/ext4/ext4.h      | 11 +++++++++++
>>>>    fs/ext4/ext4_jbd2.h | 22 ++++++++++++++++++++++
>>>>    fs/ext4/super.c     | 35 +++++++++++++++++------------------
>>>>    fs/ext4/sysfs.c     |  4 ++++
>>>>    4 files changed, 54 insertions(+), 18 deletions(-)
>>>>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-09 12:07             ` Ojaswin Mujoo
@ 2025-03-10  4:43               ` Ritesh Harjani
  2025-03-12 10:51                 ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Ritesh Harjani @ 2025-03-10  4:43 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:

> On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>> 
>> > On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
>> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>> >> 
>> >> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
>> >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>> >> >> 
>> >> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
>> >> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>> >> >> > stress tests, it was observed that in case of some error handling paths, it is
>> >> >> > possible for update_super_work to start a transaction after the journal is
>> >> >> > destroyed eg:
>> >> >> >
>> >> >> > (umount)
>> >> >> > ext4_kill_sb
>> >> >> >   kill_block_super
>> >> >> >     generic_shutdown_super
>> >> >> >       sync_filesystem /* commits all txns */
>> >> >> >       evict_inodes
>> >> >> >         /* might start a new txn */
>> >> >> >       ext4_put_super
>> >> >> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>> >> >> >         jbd2_journal_destroy
>> >> >> >           journal_kill_thread
>> >> >> >             journal->j_flags |= JBD2_UNMOUNT;
>> >> >> >           jbd2_journal_commit_transaction
>> >> >> >             jbd2_journal_get_descriptor_buffer
>> >> >> >               jbd2_journal_bmap
>> >> >> >                 ext4_journal_bmap
>> >> >> >                   ext4_map_blocks
>> >> >> >                     ...
>> >> >> >                     ext4_inode_error
>> >> >> >                       ext4_handle_error
>> >> >> >                         schedule_work(&sbi->s_sb_upd_work)
>> >> >> >
>> >> >> >                                                /* work queue kicks in */
>> >> >> >                                                update_super_work
>> >> >> >                                                  jbd2_journal_start
>> >> >> >                                                    start_this_handle
>> >> >> >                                                      BUG_ON(journal->j_flags &
>> >> >> >                                                             JBD2_UNMOUNT)
>> >> >> >
>> >> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>> >> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
>> >> >> > set. Otherwise, just fallback to an un-journaled commit.
>> >> >> >
>> >> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
>> >> >> > during ext4_put_super() (except a running transaction that will get commited
>> >> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>> >> >> > outside the journal as it won't race with a journaled update (refer
>> >> >> > 2d01ddc86606).
>> >> >> >
>> >> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
>> >> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
>> >> >> >
>> >> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>> >> >> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>> >> >> > Suggested-by: Jan Kara <jack@suse.cz>
>> >> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> >> >> > ---
>> >> >> >  fs/ext4/ext4.h      | 2 ++
>> >> >> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
>> >> >> >  fs/ext4/super.c     | 4 +++-
>> >> >> >  3 files changed, 13 insertions(+), 1 deletion(-)
>> >> >> >
>> >> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> >> >> > index 2b7d781bfcad..d48e93bd5690 100644
>> >> >> > --- a/fs/ext4/ext4.h
>> >> >> > +++ b/fs/ext4/ext4.h
>> >> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>> >> >> >  	 */
>> >> >> >  	struct work_struct s_sb_upd_work;
>> >> >> >  
>> >> >> > +	bool s_journal_destorying;
>> >> >> > +
>> >> >> >  	/* Atomic write unit values in bytes */
>> >> >> >  	unsigned int s_awu_min;
>> >> >> >  	unsigned int s_awu_max;
>> >> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> >> >> > index 9b3c9df02a39..6bd3ca84410d 100644
>> >> >> > --- a/fs/ext4/ext4_jbd2.h
>> >> >> > +++ b/fs/ext4/ext4_jbd2.h
>> >> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>> >> >> >  {
>> >> >> >  	int err = 0;
>> >> >> >  
>> >> >> > +	/*
>> >> >> > +	 * At this point all pending FS updates should be done except a possible
>> >> >> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
>> >> >> > +	 * is now safe for any new errors to directly commit superblock rather
>> >> >> > +	 * than going via journal.
>> >> >> > +	 */
>> >> >> > +	sbi->s_journal_destorying = true;
>> >> >> 
>> >> >> This is not correct right. I think what we decided to set this flag
>> >> >> before we flush the workqueue. So that we don't schedule any new
>> >> >> work after this flag has been set. At least that is what I understood.
>> >> >> 
>> >> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
>> >> >> 
>> >> >> -ritesh
>> >> >
>> >> > Hey Ritesh,
>> >> >
>> >> > Yes that is not correct, I missed that in my patch however we realised
>> >> > that adding it before flush_work() also has issues [1]. More
>> >> > specifically:
>> >> 
>> >> Ohk. right. 
>> >> 
>> >> >
>> >> >                      **kjournald2**
>> >> >                      jbd2_journal_commit_transaction()
>> >> >                      ...
>> >> >                      ext4_handle_error()
>> >> >                         /* s_journal_destorying is not set */
>> >> >                         if (journal && !s_journal_destorying)
>> >> 
>> >> Then maybe we should not schedule another work to update the superblock
>> >> via journalling, it the error itself occurred while were trying to
>> >> commit the journal txn? 
>> >> 
>> >> 
>> >> -ritesh
>> >
>> > Hmm, ideally yes that should not happen, but how can we achieve that?
>> > For example with the trace we saw:
>> >
>> >    **kjournald2**
>> >    jbd2_journal_commit_transaction()
>> >      jbd2_journal_get_descriptor_buffer
>> >        jbd2_journal_bmap
>> >          ext4_journal_bmap
>> >            ext4_map_blocks
>> >              ...
>> >              ext4_inode_error
>> >                ext4_handle_error
>> >                  schedule_work(&sbi->s_sb_upd_work)
>> >
>> > How do we tell ext4_handle_error that it is in the context of a
>> > committing txn.
>> 
>> So even if we identify that the current
>> jbd2_journal_commit_transaction() is coming from kjournald2(), that is
>> sufficient right? Because the only other place where we call
>> jbd2_journal_commit_transaction() is jbd2_journal_destroy() and that
>> happens after we can set few things from ext4_put_super() and flush work
>> is completed, correct? 
>> 
>> 
>> > We can't pass down an argument all the way down 
>> > cause that is not feasible. An sb level flag will also not work
>> > I think. Any thoughts on this?
>> 
>> I was thinking if we should have a per task flag? Something like
>> PF_KJOURNALD?  (Similar to how we have PF_KSWAPD or PF_KCOMPACTD)? This
>> can help us identify if we are a kjournald2() kthread.
>> 
>> That will help prevent scheduling another work item to start a new
>> transaction in case an error occurs while committing the currently
>> running transaction. Correct?
>
> Yes, I like this approach. I think this will also help us avoid the
> extra checks in ext4_journal_destroy() since the journal will no longer
> schecule work for updating the sb. Hence we can be sure that after the
> final flush_work() noone will try to schedule more work or start a new
> transaction.
>
> I'll try to spin up a poc and test it. Does seem like we are out of flags in 
> task struct though.
>
> Regards,
> ojaswin
>
>> 
>> Now I don't know if we have any free bit available in current->flags. If
>> not shall we use current->journal_info pointer to have 0th bit as a
>> flag? Basically override current->journal_info to also store a flag.  We
>> can create a wrapper to get the journal_info from current by masking
>> this flag bit and use it to dereference journal_info.
>
> Hmm so journal_info will be holding a kernel address of the handle. Is
> it possible to have it share a flag as well? I thought the address would
> utilize the full 64bits?


What I meant here was - 
In general I assume we should be able to play some tricks with a pointer
which at least should be 4 byte aligned. So we can save some flag bits in
the lower bits of a pointer.  For e.g. check struct address_space &
PAGE_MAPPING_FLAGS. There maybe few other examples too.


-ritesh


>
> Regards,
> ojaswin
>
>> 
>> But before going down that road, it's better to know what others think?
>> 
>> -ritesh
>> 
>> 
>> >
>> > regards,
>> > ojaswin
>> >
>> >> 
>> >> 
>> >> >   ext4_put_super()
>> >> >     sbi->s_journal_destorying = true;
>> >> >     flush_work(&sbi->s_sb_upd_work)
>> >> >                                       schedule_work()
>> >> >     jbd2_journal_destroy()
>> >> >      journal->j_flags |= JBD2_UNMOUNT;
>> >> >
>> >> >                                         jbd2_journal_start()
>> >> >                                          start_this_handle()
>> >> >                                            BUG_ON(JBD2_UNMOUNT)
>> >> >
>> >> > So the right thing to do seems to be that we need to force a journal
>> >> > commit before the final flush as well. [1] Has more info on this and
>> >> > some followup discussion as well.
>> >> >
>> >> > [1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@linux.ibm.com/T/#mc8046d47b357665bdbd2878c91e51eb660f94b3e
>> >> >
>> >> > Regards,
>> >> > ojaswin
>> >> >> 
>> >> >> 
>> >> >> > +
>> >> >> >  	err = jbd2_journal_destroy(journal);
>> >> >> >  	sbi->s_journal = NULL;
>> >> >> >  
>> >> >> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> >> >> > index 8ad664d47806..31552cf0519a 100644
>> >> >> > --- a/fs/ext4/super.c
>> >> >> > +++ b/fs/ext4/super.c
>> >> >> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>> >> >> >  		 * constraints, it may not be safe to do it right here so we
>> >> >> >  		 * defer superblock flushing to a workqueue.
>> >> >> >  		 */
>> >> >> > -		if (continue_fs && journal)
>> >> >> > +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>> >> >> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>> >> >> >  		else
>> >> >> >  			ext4_commit_super(sb);
>> >> >> > @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>> >> >> >  	spin_lock_init(&sbi->s_error_lock);
>> >> >> >  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>> >> >> >  
>> >> >> > +	sbi->s_journal_destorying = false;
>> >> >> > +
>> >> >> >  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>> >> >> >  	if (err)
>> >> >> >  		goto failed_mount3;
>> >> >> > -- 
>> >> >> > 2.48.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle
  2025-03-10  3:10       ` Baokun Li
@ 2025-03-10  7:59         ` Ojaswin Mujoo
  0 siblings, 0 replies; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-10  7:59 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel, Yang Erkun

On Mon, Mar 10, 2025 at 11:10:26AM +0800, Baokun Li wrote:
> On 2025/3/8 23:21, Ojaswin Mujoo wrote:
> > On Sat, Mar 08, 2025 at 06:37:41PM +0530, Ojaswin Mujoo wrote:
> > > On Sat, Mar 08, 2025 at 09:09:28AM +0800, Baokun Li wrote:
> > > > On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > > > > ** Changes since v1 [1] **
> > > > > 
> > > > >    * Picked up RVBs from Jan and Ritesh
> > > > >    * In patch 2/3, we now use a flag in sbi instead of SB_ACITVE
> > > > >      to determine when to journal sb vs when to commit directly.
> > > > >    * Added a prep patch 1/3
> > > > > 
> > > > > [1] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#m5e659425b8c8fe2ac01e7242b77fed315ff89db4
> > > > > 
> > > > > @Baokun, I didn't get a chance to look into the journal_inode
> > > > > modifications we were discussing in [2]. I'll try to spend some time and
> > > > > send that as a separate patch. Hope that's okay
> > > > > 
> > > > > [2] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@linux.ibm.com/T/#mad8feb44d9b6ddadf87830b92caa7b78d902dc05
> > > > That's fine, it's not a priority. And if this patch set makes sure we
> > > > don't crash when things go wrong, I'm okay with leaving it as is.
> > > > 
> > > > It's possible that jbd2_journal_commit_transaction() could call
> > > > ext4_handle_error() in other places as the code evolves. Fixing known
> > > > problems and protecting against potential ones is always a good thing.
> > > Yep thats true, I did spend some time on this since the codepath was a
> > > bit unfamiliar to me. Seems like a straighforward enough change. I'll
> > > add it to the next patch.
> > > 
> > > thanks,
> > > ojaswin
> > Hey Baokun,
> > 
> > So while coding this up, I started looking at some codepaths and it got
> > me wondering when can we actually change the sbi->s_es->s_journal_inum
> > (or sbi->s_sbh) from the time it gets populated to the umount time?
> > 
> > Since the sbi->s_sbh buffer head is always in memory and never reclaimed
> > due the elevated reference, the only way to modify it should be if we
> > modify the memory page somehow. Or is there some codepath/tooling magic
> > I'm missing that can modify this value?
> For the ext4 file system, s_journal_inum is fixed after mkfs and is not
> expected to be modified. However, the sbi->s_sbh buffer head belongs to
> the block device, so direct write operations to the file system's block
> device /dev/xxx may directly modify sbi->s_sbh.

Right got it, now I get the full picture of the change you are
suggesting. Thanks Baokun, I'll do that in next revision.

Regards,
ojaswin

> 
> For example, you can use the following script to trigger a journal bmap
> failure:
> 
> mkfs.ext4 -F /dev/sda
> mount /dev/sda /tmp/test
> dd if=/dev/zero of=/dev/sda bs=4096 count=1
> echo 1 > /tmp/test/file
> umount /tmp/test
> 
> (Adding a delay in put_super and performing the script during the delay
>  can trigger the issue described in the patch.)
> 
> 
> Cheers,
> Baokun
> > 
> > Regards,
> > ojaswin
> > > > 
> > > > Cheers,
> > > > Baokun
> > > > > ** Original Cover **
> > > > > 
> > > > > When running LTP stress tests on ext4, after a multiday run we seemed to
> > > > > have hit the following BUG_ON:
> > > > > 
> > > > >    [NIP  : start_this_handle+268]
> > > > >    #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2]  (unreliable)
> > > > >    #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
> > > > >    #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
> > > > >    #6 [c000001067c27c70] process_one_work at c000000000188790
> > > > >    #7 [c000001067c27d20] worker_thread at c00000000018973c
> > > > >    #8 [c000001067c27dc0] kthread at c000000000196c84
> > > > >    #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64
> > > > > 
> > > > > Which comes out to
> > > > > 
> > > > >     382   repeat:
> > > > >     383           read_lock(&journal->j_state_lock);
> > > > > * 384           BUG_ON(journal->j_flags & JBD2_UNMOUNT);
> > > > >     385           if (is_journal_aborted(journal) ||
> > > > >     386               (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
> > > > >     387                   read_unlock(&journal->j_state_lock);
> > > > > 
> > > > > 
> > > > > Initially this seemed like it should never happen but upon crash
> > > > > analysis it seems like it could indeed be hit as described in patch 1/2.
> > > > > 
> > > > > I would like to add that through the logs we only knew that:
> > > > > 
> > > > > - ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
> > > > > - update_super_work had hit the BUG_ON
> > > > > 
> > > > > I was not able to hit this bug again (without modifying the kernel to
> > > > > inject errors) but the above backtrace seems to be one possible paths
> > > > > where this BUG_ON can be hit. Rest of the analysis and fix is in patch
> > > > > 2/3. Patch 3 is just a small tweak that i found helpful while debugging.
> > > > > 
> > > > > That being said, journalling is something I'm not very familiar with and
> > > > > there might be gaps in my understanding so thoughts and suggestions are
> > > > > welcome.
> > > > > 
> > > > > Ojaswin Mujoo (3):
> > > > >     ext4: define ext4_journal_destroy wrapper
> > > > >     ext4: avoid journaling sb update on error if journal is destroying
> > > > >     ext4: Make sb update interval tunable
> > > > > 
> > > > >    fs/ext4/ext4.h      | 11 +++++++++++
> > > > >    fs/ext4/ext4_jbd2.h | 22 ++++++++++++++++++++++
> > > > >    fs/ext4/super.c     | 35 +++++++++++++++++------------------
> > > > >    fs/ext4/sysfs.c     |  4 ++++
> > > > >    4 files changed, 54 insertions(+), 18 deletions(-)
> > > > > 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-10  4:43               ` Ritesh Harjani
@ 2025-03-12 10:51                 ` Jan Kara
  2025-03-12 14:26                   ` Ojaswin Mujoo
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2025-03-12 10:51 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Ojaswin Mujoo, linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li,
	linux-kernel, Mahesh Kumar

On Mon 10-03-25 10:13:36, Ritesh Harjani wrote:
> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> >> > On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
> >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> >> >> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> >> >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> >> >> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> >> >> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >> >> >> > stress tests, it was observed that in case of some error handling paths, it is
> >> >> >> > possible for update_super_work to start a transaction after the journal is
> >> >> >> > destroyed eg:
> >> >> >> >
> >> >> >> > (umount)
> >> >> >> > ext4_kill_sb
> >> >> >> >   kill_block_super
> >> >> >> >     generic_shutdown_super
> >> >> >> >       sync_filesystem /* commits all txns */
> >> >> >> >       evict_inodes
> >> >> >> >         /* might start a new txn */
> >> >> >> >       ext4_put_super
> >> >> >> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >> >> >> >         jbd2_journal_destroy
> >> >> >> >           journal_kill_thread
> >> >> >> >             journal->j_flags |= JBD2_UNMOUNT;
> >> >> >> >           jbd2_journal_commit_transaction
> >> >> >> >             jbd2_journal_get_descriptor_buffer
> >> >> >> >               jbd2_journal_bmap
> >> >> >> >                 ext4_journal_bmap
> >> >> >> >                   ext4_map_blocks
> >> >> >> >                     ...
> >> >> >> >                     ext4_inode_error
> >> >> >> >                       ext4_handle_error
> >> >> >> >                         schedule_work(&sbi->s_sb_upd_work)
> >> >> >> >
> >> >> >> >                                                /* work queue kicks in */
> >> >> >> >                                                update_super_work
> >> >> >> >                                                  jbd2_journal_start
> >> >> >> >                                                    start_this_handle
> >> >> >> >                                                      BUG_ON(journal->j_flags &
> >> >> >> >                                                             JBD2_UNMOUNT)
> >> >> >> >
> >> >> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >> >> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
> >> >> >> > set. Otherwise, just fallback to an un-journaled commit.
> >> >> >> >
> >> >> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> >> >> >> > during ext4_put_super() (except a running transaction that will get commited
> >> >> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >> >> >> > outside the journal as it won't race with a journaled update (refer
> >> >> >> > 2d01ddc86606).
> >> >> >> >
> >> >> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >> >> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
> >> >> >> >
> >> >> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >> >> >> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> >> >> >> > Suggested-by: Jan Kara <jack@suse.cz>
> >> >> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >> >> >> > ---
> >> >> >> >  fs/ext4/ext4.h      | 2 ++
> >> >> >> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> >> >> >> >  fs/ext4/super.c     | 4 +++-
> >> >> >> >  3 files changed, 13 insertions(+), 1 deletion(-)
> >> >> >> >
> >> >> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> >> >> > index 2b7d781bfcad..d48e93bd5690 100644
> >> >> >> > --- a/fs/ext4/ext4.h
> >> >> >> > +++ b/fs/ext4/ext4.h
> >> >> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >> >> >> >  	 */
> >> >> >> >  	struct work_struct s_sb_upd_work;
> >> >> >> >  
> >> >> >> > +	bool s_journal_destorying;
> >> >> >> > +
> >> >> >> >  	/* Atomic write unit values in bytes */
> >> >> >> >  	unsigned int s_awu_min;
> >> >> >> >  	unsigned int s_awu_max;
> >> >> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >> >> >> > index 9b3c9df02a39..6bd3ca84410d 100644
> >> >> >> > --- a/fs/ext4/ext4_jbd2.h
> >> >> >> > +++ b/fs/ext4/ext4_jbd2.h
> >> >> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >> >> >> >  {
> >> >> >> >  	int err = 0;
> >> >> >> >  
> >> >> >> > +	/*
> >> >> >> > +	 * At this point all pending FS updates should be done except a possible
> >> >> >> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> >> >> >> > +	 * is now safe for any new errors to directly commit superblock rather
> >> >> >> > +	 * than going via journal.
> >> >> >> > +	 */
> >> >> >> > +	sbi->s_journal_destorying = true;
> >> >> >> 
> >> >> >> This is not correct right. I think what we decided to set this flag
> >> >> >> before we flush the workqueue. So that we don't schedule any new
> >> >> >> work after this flag has been set. At least that is what I understood.
> >> >> >> 
> >> >> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
> >> >> >> 
> >> >> >> -ritesh
> >> >> >
> >> >> > Hey Ritesh,
> >> >> >
> >> >> > Yes that is not correct, I missed that in my patch however we realised
> >> >> > that adding it before flush_work() also has issues [1]. More
> >> >> > specifically:
> >> >> 
> >> >> Ohk. right. 
> >> >> 
> >> >> >
> >> >> >                      **kjournald2**
> >> >> >                      jbd2_journal_commit_transaction()
> >> >> >                      ...
> >> >> >                      ext4_handle_error()
> >> >> >                         /* s_journal_destorying is not set */
> >> >> >                         if (journal && !s_journal_destorying)
> >> >> 
> >> >> Then maybe we should not schedule another work to update the superblock
> >> >> via journalling, it the error itself occurred while were trying to
> >> >> commit the journal txn? 
> >> >> 
> >> >> 
> >> >> -ritesh
> >> >
> >> > Hmm, ideally yes that should not happen, but how can we achieve that?
> >> > For example with the trace we saw:
> >> >
> >> >    **kjournald2**
> >> >    jbd2_journal_commit_transaction()
> >> >      jbd2_journal_get_descriptor_buffer
> >> >        jbd2_journal_bmap
> >> >          ext4_journal_bmap
> >> >            ext4_map_blocks
> >> >              ...
> >> >              ext4_inode_error
> >> >                ext4_handle_error
> >> >                  schedule_work(&sbi->s_sb_upd_work)
> >> >
> >> > How do we tell ext4_handle_error that it is in the context of a
> >> > committing txn.

So I was thinking about this. It is not a problem to determine we are
running in kjournald context - it is enough to check

	current == EXT4_SB(sb)->s_journal->j_task

But I'm not sure checking this in ext4_handle_error() and doing direct sb
update instead of scheduling a journalled one is always correct. For
example kjournald does also writeback of ordered data and if that hits an
error, we do not necessarily abort the journal (well, currently we do as
far as I'm checking but it seems a bit fragile to rely on this).

So I'd rather keep the solution for these umount issues specific to the
umount path. What if we did:

static void ext4_journal_destroy(struct super_block *sb)
{
	/*
	 * At this point only two things can be operating on the journal.
	 * JBD2 thread performing transaction commit and s_sb_upd_work
	 * issuing sb update through the journal. Once we set
	 * EXT4_FLAGS_JOURNAL_DESTROY, new ext4_handle_error() calls will not
	 * queue s_sb_upd_work and ext4_force_commit() makes sure any
	 * ext4_handle_error() calls from the running transaction commit are
	 * finished. Hence no new s_sb_upd_work can be queued after we
	 * flush it here.
	 */
	set_bit(EXT4_FLAGS_JOURNAL_DESTROY, &EXT4_SB(sb)->s_ext4_flags);
	ext4_force_commit(sb);
	flush_work(&EXT4_SB(sb)->s_sb_upd_work);
	jbd2_journal_destroy(EXT4_SB(sb)->s_journal);
}

And then add the check to ext4_handle_error():

                /*
                 * In case the fs should keep running, we need to writeout
                 * superblock through the journal. Due to lock ordering
                 * constraints, it may not be safe to do it right here so we
-		 * defer superblock flushing to a workqueue.
+		 * defer superblock flushing to a workqueue. We just need
+		 * to be careful when the journal is already shutting down.
+		 * If we get here in that case, just update the sb directly
+		 * as the last transaction won't commit anyway.
                 */
-		if (continue_fs && journal)
+		if (continue_fs && journal &&
+		    !test_bit(EXT4_FLAGS_JOURNAL_DESTROY,
+			      &EXT4_SB(sb)->s_ext4_flags))
			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
		else
			ext4_commit_super(sb);

What do people think about this? Am I missing some possible race?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
                     ` (2 preceding siblings ...)
  2025-03-08  9:55   ` Ritesh Harjani (IBM)
@ 2025-03-12 13:57   ` Eric Sandeen
  2025-03-12 14:27     ` Ojaswin Mujoo
  3 siblings, 1 reply; 41+ messages in thread
From: Eric Sandeen @ 2025-03-12 13:57 UTC (permalink / raw)
  To: Ojaswin Mujoo, linux-ext4, Theodore Ts'o
  Cc: Jan Kara, Baokun Li, linux-kernel, Mahesh Kumar

On 3/6/25 8:28 AM, Ojaswin Mujoo wrote:
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2b7d781bfcad..d48e93bd5690 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>  	 */
>  	struct work_struct s_sb_upd_work;
>  
> +	bool s_journal_destorying;

Just a late nitpick and noted that Jan suggested making this a flag,
but just in case, this is a typo: s_journal_destorying -> s_journal_destroying

Thanks,
-Eric

>  	/* Atomic write unit values in bytes */
>  	unsigned int s_awu_min;
>  	unsigned int s_awu_max;


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-12 10:51                 ` Jan Kara
@ 2025-03-12 14:26                   ` Ojaswin Mujoo
  2025-03-12 17:15                     ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-12 14:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, linux-ext4, Theodore Ts'o, Baokun Li,
	linux-kernel, Mahesh Kumar

On Wed, Mar 12, 2025 at 11:51:03AM +0100, Jan Kara wrote:
> On Mon 10-03-25 10:13:36, Ritesh Harjani wrote:
> > Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > > On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
> > >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > >> > On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
> > >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > >> >> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> > >> >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > >> >> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > >> >> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > >> >> >> > stress tests, it was observed that in case of some error handling paths, it is
> > >> >> >> > possible for update_super_work to start a transaction after the journal is
> > >> >> >> > destroyed eg:
> > >> >> >> >
> > >> >> >> > (umount)
> > >> >> >> > ext4_kill_sb
> > >> >> >> >   kill_block_super
> > >> >> >> >     generic_shutdown_super
> > >> >> >> >       sync_filesystem /* commits all txns */
> > >> >> >> >       evict_inodes
> > >> >> >> >         /* might start a new txn */
> > >> >> >> >       ext4_put_super
> > >> >> >> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > >> >> >> >         jbd2_journal_destroy
> > >> >> >> >           journal_kill_thread
> > >> >> >> >             journal->j_flags |= JBD2_UNMOUNT;
> > >> >> >> >           jbd2_journal_commit_transaction
> > >> >> >> >             jbd2_journal_get_descriptor_buffer
> > >> >> >> >               jbd2_journal_bmap
> > >> >> >> >                 ext4_journal_bmap
> > >> >> >> >                   ext4_map_blocks
> > >> >> >> >                     ...
> > >> >> >> >                     ext4_inode_error
> > >> >> >> >                       ext4_handle_error
> > >> >> >> >                         schedule_work(&sbi->s_sb_upd_work)
> > >> >> >> >
> > >> >> >> >                                                /* work queue kicks in */
> > >> >> >> >                                                update_super_work
> > >> >> >> >                                                  jbd2_journal_start
> > >> >> >> >                                                    start_this_handle
> > >> >> >> >                                                      BUG_ON(journal->j_flags &
> > >> >> >> >                                                             JBD2_UNMOUNT)
> > >> >> >> >
> > >> >> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > >> >> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
> > >> >> >> > set. Otherwise, just fallback to an un-journaled commit.
> > >> >> >> >
> > >> >> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > >> >> >> > during ext4_put_super() (except a running transaction that will get commited
> > >> >> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > >> >> >> > outside the journal as it won't race with a journaled update (refer
> > >> >> >> > 2d01ddc86606).
> > >> >> >> >
> > >> >> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > >> >> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
> > >> >> >> >
> > >> >> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > >> >> >> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > >> >> >> > Suggested-by: Jan Kara <jack@suse.cz>
> > >> >> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > >> >> >> > ---
> > >> >> >> >  fs/ext4/ext4.h      | 2 ++
> > >> >> >> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> > >> >> >> >  fs/ext4/super.c     | 4 +++-
> > >> >> >> >  3 files changed, 13 insertions(+), 1 deletion(-)
> > >> >> >> >
> > >> >> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > >> >> >> > index 2b7d781bfcad..d48e93bd5690 100644
> > >> >> >> > --- a/fs/ext4/ext4.h
> > >> >> >> > +++ b/fs/ext4/ext4.h
> > >> >> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > >> >> >> >  	 */
> > >> >> >> >  	struct work_struct s_sb_upd_work;
> > >> >> >> >  
> > >> >> >> > +	bool s_journal_destorying;
> > >> >> >> > +
> > >> >> >> >  	/* Atomic write unit values in bytes */
> > >> >> >> >  	unsigned int s_awu_min;
> > >> >> >> >  	unsigned int s_awu_max;
> > >> >> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > >> >> >> > index 9b3c9df02a39..6bd3ca84410d 100644
> > >> >> >> > --- a/fs/ext4/ext4_jbd2.h
> > >> >> >> > +++ b/fs/ext4/ext4_jbd2.h
> > >> >> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > >> >> >> >  {
> > >> >> >> >  	int err = 0;
> > >> >> >> >  
> > >> >> >> > +	/*
> > >> >> >> > +	 * At this point all pending FS updates should be done except a possible
> > >> >> >> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > >> >> >> > +	 * is now safe for any new errors to directly commit superblock rather
> > >> >> >> > +	 * than going via journal.
> > >> >> >> > +	 */
> > >> >> >> > +	sbi->s_journal_destorying = true;
> > >> >> >> 
> > >> >> >> This is not correct right. I think what we decided to set this flag
> > >> >> >> before we flush the workqueue. So that we don't schedule any new
> > >> >> >> work after this flag has been set. At least that is what I understood.
> > >> >> >> 
> > >> >> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
> > >> >> >> 
> > >> >> >> -ritesh
> > >> >> >
> > >> >> > Hey Ritesh,
> > >> >> >
> > >> >> > Yes that is not correct, I missed that in my patch however we realised
> > >> >> > that adding it before flush_work() also has issues [1]. More
> > >> >> > specifically:
> > >> >> 
> > >> >> Ohk. right. 
> > >> >> 
> > >> >> >
> > >> >> >                      **kjournald2**
> > >> >> >                      jbd2_journal_commit_transaction()
> > >> >> >                      ...
> > >> >> >                      ext4_handle_error()
> > >> >> >                         /* s_journal_destorying is not set */
> > >> >> >                         if (journal && !s_journal_destorying)
> > >> >> 
> > >> >> Then maybe we should not schedule another work to update the superblock
> > >> >> via journalling, it the error itself occurred while were trying to
> > >> >> commit the journal txn? 
> > >> >> 
> > >> >> 
> > >> >> -ritesh
> > >> >
> > >> > Hmm, ideally yes that should not happen, but how can we achieve that?
> > >> > For example with the trace we saw:
> > >> >
> > >> >    **kjournald2**
> > >> >    jbd2_journal_commit_transaction()
> > >> >      jbd2_journal_get_descriptor_buffer
> > >> >        jbd2_journal_bmap
> > >> >          ext4_journal_bmap
> > >> >            ext4_map_blocks
> > >> >              ...
> > >> >              ext4_inode_error
> > >> >                ext4_handle_error
> > >> >                  schedule_work(&sbi->s_sb_upd_work)
> > >> >
> > >> > How do we tell ext4_handle_error that it is in the context of a
> > >> > committing txn.
> 
> So I was thinking about this. It is not a problem to determine we are
> running in kjournald context - it is enough to check
> 
> 	current == EXT4_SB(sb)->s_journal->j_task

Oh, right :) 

> 
> But I'm not sure checking this in ext4_handle_error() and doing direct sb
> update instead of scheduling a journalled one is always correct. For
> example kjournald does also writeback of ordered data and if that hits an
> error, we do not necessarily abort the journal (well, currently we do as
> far as I'm checking but it seems a bit fragile to rely on this).

Okay so IIUC your concern is there might be some codepaths, now or in
the future, where kjournald might call the FS layer, hit an error and
still decide to not abort. In which case we would still want to update
the sb via journal.

Im having some difficulty imagining when this can happen but I'm okay
with the alternate approach of hardening the umount path as well. Fwiw
this approach did feel a bit more cleaner as we could avoid the journal
commit and flushing dance in ext4_journal_destory()

> 
> So I'd rather keep the solution for these umount issues specific to the
> umount path. What if we did:
> 
> static void ext4_journal_destroy(struct super_block *sb)
> {
> 	/*
> 	 * At this point only two things can be operating on the journal.
> 	 * JBD2 thread performing transaction commit and s_sb_upd_work
> 	 * issuing sb update through the journal. Once we set
> 	 * EXT4_FLAGS_JOURNAL_DESTROY, new ext4_handle_error() calls will not
> 	 * queue s_sb_upd_work and ext4_force_commit() makes sure any
> 	 * ext4_handle_error() calls from the running transaction commit are
> 	 * finished. Hence no new s_sb_upd_work can be queued after we
> 	 * flush it here.
> 	 */
> 	set_bit(EXT4_FLAGS_JOURNAL_DESTROY, &EXT4_SB(sb)->s_ext4_flags);

Offtopic, how are s_ext4_flags different from s_mount_flags, since in a
draft patchset for this, I am using:

	ext4_set_mount_flag(sbi->s_sb, EXT4_JBD2_DESTORYING);

so just curious.

> 	ext4_force_commit(sb);
> 	flush_work(&EXT4_SB(sb)->s_sb_upd_work);
> 	jbd2_journal_destroy(EXT4_SB(sb)->s_journal);
> }
> 
> And then add the check to ext4_handle_error():
> 
>                 /*
>                  * In case the fs should keep running, we need to writeout
>                  * superblock through the journal. Due to lock ordering
>                  * constraints, it may not be safe to do it right here so we
> -		 * defer superblock flushing to a workqueue.
> +		 * defer superblock flushing to a workqueue. We just need
> +		 * to be careful when the journal is already shutting down.
> +		 * If we get here in that case, just update the sb directly
> +		 * as the last transaction won't commit anyway.
>                  */
> -		if (continue_fs && journal)
> +		if (continue_fs && journal &&
> +		    !test_bit(EXT4_FLAGS_JOURNAL_DESTROY,
> +			      &EXT4_SB(sb)->s_ext4_flags))
> 			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> 		else
> 			ext4_commit_super(sb);
> 
> What do people think about this? Am I missing some possible race?

This approach looks good Jan, I think its similar to what we were
discussing previous to the dont-schedule-if-kjournald alternative.

I was working on a similar patchset as this, I'll borrow your wording
for the comments and run some more tests on this. Thanks for the review.

Regards,
ojaswin

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-12 13:57   ` Eric Sandeen
@ 2025-03-12 14:27     ` Ojaswin Mujoo
  0 siblings, 0 replies; 41+ messages in thread
From: Ojaswin Mujoo @ 2025-03-12 14:27 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Baokun Li, linux-kernel,
	Mahesh Kumar

On Wed, Mar 12, 2025 at 08:57:34AM -0500, Eric Sandeen wrote:
> On 3/6/25 8:28 AM, Ojaswin Mujoo wrote:
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 2b7d781bfcad..d48e93bd5690 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >  	 */
> >  	struct work_struct s_sb_upd_work;
> >  
> > +	bool s_journal_destorying;
> 
> Just a late nitpick and noted that Jan suggested making this a flag,
> but just in case, this is a typo: s_journal_destorying -> s_journal_destroying
> 
> Thanks,
> -Eric

Thanks for pointing that Eric. We will be moving to a flag in next
version mostly so we should be okay :)

regards,
ojaswin

> 
> >  	/* Atomic write unit values in bytes */
> >  	unsigned int s_awu_min;
> >  	unsigned int s_awu_max;
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-12 14:26                   ` Ojaswin Mujoo
@ 2025-03-12 17:15                     ` Jan Kara
  2025-03-13  1:20                       ` Zhang Yi
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2025-03-12 17:15 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Ritesh Harjani, linux-ext4, Theodore Ts'o,
	Baokun Li, linux-kernel, Mahesh Kumar

On Wed 12-03-25 19:56:36, Ojaswin Mujoo wrote:
> On Wed, Mar 12, 2025 at 11:51:03AM +0100, Jan Kara wrote:
> > On Mon 10-03-25 10:13:36, Ritesh Harjani wrote:
> > > Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > > > On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
> > > >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > > >> > On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
> > > >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > > >> >> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> > > >> >> >> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > > >> >> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > > >> >> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > > >> >> >> > stress tests, it was observed that in case of some error handling paths, it is
> > > >> >> >> > possible for update_super_work to start a transaction after the journal is
> > > >> >> >> > destroyed eg:
> > > >> >> >> >
> > > >> >> >> > (umount)
> > > >> >> >> > ext4_kill_sb
> > > >> >> >> >   kill_block_super
> > > >> >> >> >     generic_shutdown_super
> > > >> >> >> >       sync_filesystem /* commits all txns */
> > > >> >> >> >       evict_inodes
> > > >> >> >> >         /* might start a new txn */
> > > >> >> >> >       ext4_put_super
> > > >> >> >> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > >> >> >> >         jbd2_journal_destroy
> > > >> >> >> >           journal_kill_thread
> > > >> >> >> >             journal->j_flags |= JBD2_UNMOUNT;
> > > >> >> >> >           jbd2_journal_commit_transaction
> > > >> >> >> >             jbd2_journal_get_descriptor_buffer
> > > >> >> >> >               jbd2_journal_bmap
> > > >> >> >> >                 ext4_journal_bmap
> > > >> >> >> >                   ext4_map_blocks
> > > >> >> >> >                     ...
> > > >> >> >> >                     ext4_inode_error
> > > >> >> >> >                       ext4_handle_error
> > > >> >> >> >                         schedule_work(&sbi->s_sb_upd_work)
> > > >> >> >> >
> > > >> >> >> >                                                /* work queue kicks in */
> > > >> >> >> >                                                update_super_work
> > > >> >> >> >                                                  jbd2_journal_start
> > > >> >> >> >                                                    start_this_handle
> > > >> >> >> >                                                      BUG_ON(journal->j_flags &
> > > >> >> >> >                                                             JBD2_UNMOUNT)
> > > >> >> >> >
> > > >> >> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > > >> >> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
> > > >> >> >> > set. Otherwise, just fallback to an un-journaled commit.
> > > >> >> >> >
> > > >> >> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > > >> >> >> > during ext4_put_super() (except a running transaction that will get commited
> > > >> >> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > > >> >> >> > outside the journal as it won't race with a journaled update (refer
> > > >> >> >> > 2d01ddc86606).
> > > >> >> >> >
> > > >> >> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > > >> >> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
> > > >> >> >> >
> > > >> >> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > >> >> >> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > > >> >> >> > Suggested-by: Jan Kara <jack@suse.cz>
> > > >> >> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > >> >> >> > ---
> > > >> >> >> >  fs/ext4/ext4.h      | 2 ++
> > > >> >> >> >  fs/ext4/ext4_jbd2.h | 8 ++++++++
> > > >> >> >> >  fs/ext4/super.c     | 4 +++-
> > > >> >> >> >  3 files changed, 13 insertions(+), 1 deletion(-)
> > > >> >> >> >
> > > >> >> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > >> >> >> > index 2b7d781bfcad..d48e93bd5690 100644
> > > >> >> >> > --- a/fs/ext4/ext4.h
> > > >> >> >> > +++ b/fs/ext4/ext4.h
> > > >> >> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > > >> >> >> >  	 */
> > > >> >> >> >  	struct work_struct s_sb_upd_work;
> > > >> >> >> >  
> > > >> >> >> > +	bool s_journal_destorying;
> > > >> >> >> > +
> > > >> >> >> >  	/* Atomic write unit values in bytes */
> > > >> >> >> >  	unsigned int s_awu_min;
> > > >> >> >> >  	unsigned int s_awu_max;
> > > >> >> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > > >> >> >> > index 9b3c9df02a39..6bd3ca84410d 100644
> > > >> >> >> > --- a/fs/ext4/ext4_jbd2.h
> > > >> >> >> > +++ b/fs/ext4/ext4_jbd2.h
> > > >> >> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > > >> >> >> >  {
> > > >> >> >> >  	int err = 0;
> > > >> >> >> >  
> > > >> >> >> > +	/*
> > > >> >> >> > +	 * At this point all pending FS updates should be done except a possible
> > > >> >> >> > +	 * running transaction (which will commit in jbd2_journal_destroy). It
> > > >> >> >> > +	 * is now safe for any new errors to directly commit superblock rather
> > > >> >> >> > +	 * than going via journal.
> > > >> >> >> > +	 */
> > > >> >> >> > +	sbi->s_journal_destorying = true;
> > > >> >> >> 
> > > >> >> >> This is not correct right. I think what we decided to set this flag
> > > >> >> >> before we flush the workqueue. So that we don't schedule any new
> > > >> >> >> work after this flag has been set. At least that is what I understood.
> > > >> >> >> 
> > > >> >> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
> > > >> >> >> 
> > > >> >> >> -ritesh
> > > >> >> >
> > > >> >> > Hey Ritesh,
> > > >> >> >
> > > >> >> > Yes that is not correct, I missed that in my patch however we realised
> > > >> >> > that adding it before flush_work() also has issues [1]. More
> > > >> >> > specifically:
> > > >> >> 
> > > >> >> Ohk. right. 
> > > >> >> 
> > > >> >> >
> > > >> >> >                      **kjournald2**
> > > >> >> >                      jbd2_journal_commit_transaction()
> > > >> >> >                      ...
> > > >> >> >                      ext4_handle_error()
> > > >> >> >                         /* s_journal_destorying is not set */
> > > >> >> >                         if (journal && !s_journal_destorying)
> > > >> >> 
> > > >> >> Then maybe we should not schedule another work to update the superblock
> > > >> >> via journalling, it the error itself occurred while were trying to
> > > >> >> commit the journal txn? 
> > > >> >> 
> > > >> >> 
> > > >> >> -ritesh
> > > >> >
> > > >> > Hmm, ideally yes that should not happen, but how can we achieve that?
> > > >> > For example with the trace we saw:
> > > >> >
> > > >> >    **kjournald2**
> > > >> >    jbd2_journal_commit_transaction()
> > > >> >      jbd2_journal_get_descriptor_buffer
> > > >> >        jbd2_journal_bmap
> > > >> >          ext4_journal_bmap
> > > >> >            ext4_map_blocks
> > > >> >              ...
> > > >> >              ext4_inode_error
> > > >> >                ext4_handle_error
> > > >> >                  schedule_work(&sbi->s_sb_upd_work)
> > > >> >
> > > >> > How do we tell ext4_handle_error that it is in the context of a
> > > >> > committing txn.
> > 
> > So I was thinking about this. It is not a problem to determine we are
> > running in kjournald context - it is enough to check
> > 
> > 	current == EXT4_SB(sb)->s_journal->j_task
> 
> Oh, right :) 
> 
> > 
> > But I'm not sure checking this in ext4_handle_error() and doing direct sb
> > update instead of scheduling a journalled one is always correct. For
> > example kjournald does also writeback of ordered data and if that hits an
> > error, we do not necessarily abort the journal (well, currently we do as
> > far as I'm checking but it seems a bit fragile to rely on this).
> 
> Okay so IIUC your concern is there might be some codepaths, now or in
> the future, where kjournald might call the FS layer, hit an error and
> still decide to not abort. In which case we would still want to update
> the sb via journal.

Yeah. The reason why I'm a bit concerned about it is mostly the case of
kjournald also handling ordered data and situations like
!(journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR) where people want to
continue although ordered data had issues. Or situations where something in
j_commit_callback or another jbd2 hook ends up calling ext4_error()...

> > static void ext4_journal_destroy(struct super_block *sb)
> > {
> > 	/*
> > 	 * At this point only two things can be operating on the journal.
> > 	 * JBD2 thread performing transaction commit and s_sb_upd_work
> > 	 * issuing sb update through the journal. Once we set
> > 	 * EXT4_FLAGS_JOURNAL_DESTROY, new ext4_handle_error() calls will not
> > 	 * queue s_sb_upd_work and ext4_force_commit() makes sure any
> > 	 * ext4_handle_error() calls from the running transaction commit are
> > 	 * finished. Hence no new s_sb_upd_work can be queued after we
> > 	 * flush it here.
> > 	 */
> > 	set_bit(EXT4_FLAGS_JOURNAL_DESTROY, &EXT4_SB(sb)->s_ext4_flags);
> 
> Offtopic, how are s_ext4_flags different from s_mount_flags, since in a
> draft patchset for this, I am using:
> 
> 	ext4_set_mount_flag(sbi->s_sb, EXT4_JBD2_DESTORYING);
> 
> so just curious.

I don't think there's a difference and I think we can unify them. For now
pick whatever you like :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-12 17:15                     ` Jan Kara
@ 2025-03-13  1:20                       ` Zhang Yi
  2025-03-13  2:08                         ` Baokun Li
  0 siblings, 1 reply; 41+ messages in thread
From: Zhang Yi @ 2025-03-13  1:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, linux-ext4, Theodore Ts'o, Baokun Li,
	linux-kernel, Mahesh Kumar, Ojaswin Mujoo

On 2025/3/13 1:15, Jan Kara wrote:
> On Wed 12-03-25 19:56:36, Ojaswin Mujoo wrote:
>> On Wed, Mar 12, 2025 at 11:51:03AM +0100, Jan Kara wrote:
>>> On Mon 10-03-25 10:13:36, Ritesh Harjani wrote:
>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>> On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
>>>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>>>> On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
>>>>>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>>>>>> On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
>>>>>>>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>>>>> destroyed eg:
>>>>>>>>>>>
>>>>>>>>>>> (umount)
>>>>>>>>>>> ext4_kill_sb
>>>>>>>>>>>   kill_block_super
>>>>>>>>>>>     generic_shutdown_super
>>>>>>>>>>>       sync_filesystem /* commits all txns */
>>>>>>>>>>>       evict_inodes
>>>>>>>>>>>         /* might start a new txn */
>>>>>>>>>>>       ext4_put_super
>>>>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>>>>         jbd2_journal_destroy
>>>>>>>>>>>           journal_kill_thread
>>>>>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>>>>           jbd2_journal_commit_transaction
>>>>>>>>>>>             jbd2_journal_get_descriptor_buffer
>>>>>>>>>>>               jbd2_journal_bmap
>>>>>>>>>>>                 ext4_journal_bmap
>>>>>>>>>>>                   ext4_map_blocks
>>>>>>>>>>>                     ...
>>>>>>>>>>>                     ext4_inode_error
>>>>>>>>>>>                       ext4_handle_error
>>>>>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>>>>
>>>>>>>>>>>                                                /* work queue kicks in */
>>>>>>>>>>>                                                update_super_work
>>>>>>>>>>>                                                  jbd2_journal_start
>>>>>>>>>>>                                                    start_this_handle
>>>>>>>>>>>                                                      BUG_ON(journal->j_flags &
>>>>>>>>>>>                                                             JBD2_UNMOUNT)
>>>>>>>>>>>
>>>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>>>>
>>>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>>>>> 2d01ddc86606).
>>>>>>>>>>>
>>>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>>>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  fs/ext4/ext4.h      | 2 ++
>>>>>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>>>>  fs/ext4/super.c     | 4 +++-
>>>>>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>>>>  	 */
>>>>>>>>>>>  	struct work_struct s_sb_upd_work;
>>>>>>>>>>>  
>>>>>>>>>>> +	bool s_journal_destorying;
>>>>>>>>>>> +
>>>>>>>>>>>  	/* Atomic write unit values in bytes */
>>>>>>>>>>>  	unsigned int s_awu_min;
>>>>>>>>>>>  	unsigned int s_awu_max;
>>>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>>>>  {
>>>>>>>>>>>  	int err = 0;
>>>>>>>>>>>  
>>>>>>>>>>> +	/*
>>>>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>>>>> +	 * than going via journal.
>>>>>>>>>>> +	 */
>>>>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>>>>
>>>>>>>>>> This is not correct right. I think what we decided to set this flag
>>>>>>>>>> before we flush the workqueue. So that we don't schedule any new
>>>>>>>>>> work after this flag has been set. At least that is what I understood.
>>>>>>>>>>
>>>>>>>>>> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
>>>>>>>>>>
>>>>>>>>>> -ritesh
>>>>>>>>>
>>>>>>>>> Hey Ritesh,
>>>>>>>>>
>>>>>>>>> Yes that is not correct, I missed that in my patch however we realised
>>>>>>>>> that adding it before flush_work() also has issues [1]. More
>>>>>>>>> specifically:
>>>>>>>>
>>>>>>>> Ohk. right. 
>>>>>>>>
>>>>>>>>>
>>>>>>>>>                      **kjournald2**
>>>>>>>>>                      jbd2_journal_commit_transaction()
>>>>>>>>>                      ...
>>>>>>>>>                      ext4_handle_error()
>>>>>>>>>                         /* s_journal_destorying is not set */
>>>>>>>>>                         if (journal && !s_journal_destorying)
>>>>>>>>
>>>>>>>> Then maybe we should not schedule another work to update the superblock
>>>>>>>> via journalling, it the error itself occurred while were trying to
>>>>>>>> commit the journal txn? 
>>>>>>>>
>>>>>>>>
>>>>>>>> -ritesh
>>>>>>>
>>>>>>> Hmm, ideally yes that should not happen, but how can we achieve that?
>>>>>>> For example with the trace we saw:
>>>>>>>
>>>>>>>    **kjournald2**
>>>>>>>    jbd2_journal_commit_transaction()
>>>>>>>      jbd2_journal_get_descriptor_buffer
>>>>>>>        jbd2_journal_bmap
>>>>>>>          ext4_journal_bmap
>>>>>>>            ext4_map_blocks
>>>>>>>              ...
>>>>>>>              ext4_inode_error
>>>>>>>                ext4_handle_error
>>>>>>>                  schedule_work(&sbi->s_sb_upd_work)
>>>>>>>
>>>>>>> How do we tell ext4_handle_error that it is in the context of a
>>>>>>> committing txn.
>>>
>>> So I was thinking about this. It is not a problem to determine we are
>>> running in kjournald context - it is enough to check
>>>
>>> 	current == EXT4_SB(sb)->s_journal->j_task
>>
>> Oh, right :) 
>>
>>>
>>> But I'm not sure checking this in ext4_handle_error() and doing direct sb
>>> update instead of scheduling a journalled one is always correct. For
>>> example kjournald does also writeback of ordered data and if that hits an
>>> error, we do not necessarily abort the journal (well, currently we do as
>>> far as I'm checking but it seems a bit fragile to rely on this).
>>
>> Okay so IIUC your concern is there might be some codepaths, now or in
>> the future, where kjournald might call the FS layer, hit an error and
>> still decide to not abort. In which case we would still want to update
>> the sb via journal.
> 
> Yeah. The reason why I'm a bit concerned about it is mostly the case of
> kjournald also handling ordered data and situations like
> !(journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR) where people want to
> continue although ordered data had issues. Or situations where something in
> j_commit_callback or another jbd2 hook ends up calling ext4_error()...
> 

Ha, right! This is a case where kjournald triggers an ext4 error but does
not abort the journal for now, I forgot this one, and there may be more.
Thanks for pointing it out. I would also prefer to use this solution of
adding ext4_journal_destory().

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
  2025-03-13  1:20                       ` Zhang Yi
@ 2025-03-13  2:08                         ` Baokun Li
  0 siblings, 0 replies; 41+ messages in thread
From: Baokun Li @ 2025-03-13  2:08 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi, Jan Kara
  Cc: Ritesh Harjani, linux-ext4, Theodore Ts'o, linux-kernel,
	Mahesh Kumar, Yang Erkun

On 2025/3/13 9:20, Zhang Yi wrote:
> On 2025/3/13 1:15, Jan Kara wrote:
>> On Wed 12-03-25 19:56:36, Ojaswin Mujoo wrote:
>>> On Wed, Mar 12, 2025 at 11:51:03AM +0100, Jan Kara wrote:
>>>> On Mon 10-03-25 10:13:36, Ritesh Harjani wrote:
>>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>>> On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
>>>>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>>>>> On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
>>>>>>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>>>>>>> On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
>>>>>>>>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>>>>>> destroyed eg:
>>>>>>>>>>>>
>>>>>>>>>>>> (umount)
>>>>>>>>>>>> ext4_kill_sb
>>>>>>>>>>>>    kill_block_super
>>>>>>>>>>>>      generic_shutdown_super
>>>>>>>>>>>>        sync_filesystem /* commits all txns */
>>>>>>>>>>>>        evict_inodes
>>>>>>>>>>>>          /* might start a new txn */
>>>>>>>>>>>>        ext4_put_super
>>>>>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>>>>>          jbd2_journal_destroy
>>>>>>>>>>>>            journal_kill_thread
>>>>>>>>>>>>              journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>>>>>            jbd2_journal_commit_transaction
>>>>>>>>>>>>              jbd2_journal_get_descriptor_buffer
>>>>>>>>>>>>                jbd2_journal_bmap
>>>>>>>>>>>>                  ext4_journal_bmap
>>>>>>>>>>>>                    ext4_map_blocks
>>>>>>>>>>>>                      ...
>>>>>>>>>>>>                      ext4_inode_error
>>>>>>>>>>>>                        ext4_handle_error
>>>>>>>>>>>>                          schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>>>>>
>>>>>>>>>>>>                                                 /* work queue kicks in */
>>>>>>>>>>>>                                                 update_super_work
>>>>>>>>>>>>                                                   jbd2_journal_start
>>>>>>>>>>>>                                                     start_this_handle
>>>>>>>>>>>>                                                       BUG_ON(journal->j_flags &
>>>>>>>>>>>>                                                              JBD2_UNMOUNT)
>>>>>>>>>>>>
>>>>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>>>>>
>>>>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>>>>>> 2d01ddc86606).
>>>>>>>>>>>>
>>>>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>>>>>
>>>>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>>>>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>   fs/ext4/ext4.h      | 2 ++
>>>>>>>>>>>>   fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>>>>>   fs/ext4/super.c     | 4 +++-
>>>>>>>>>>>>   3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>>>>>   	 */
>>>>>>>>>>>>   	struct work_struct s_sb_upd_work;
>>>>>>>>>>>>   
>>>>>>>>>>>> +	bool s_journal_destorying;
>>>>>>>>>>>> +
>>>>>>>>>>>>   	/* Atomic write unit values in bytes */
>>>>>>>>>>>>   	unsigned int s_awu_min;
>>>>>>>>>>>>   	unsigned int s_awu_max;
>>>>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>>>>>   {
>>>>>>>>>>>>   	int err = 0;
>>>>>>>>>>>>   
>>>>>>>>>>>> +	/*
>>>>>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>>>>>> +	 * than going via journal.
>>>>>>>>>>>> +	 */
>>>>>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>>>>> This is not correct right. I think what we decided to set this flag
>>>>>>>>>>> before we flush the workqueue. So that we don't schedule any new
>>>>>>>>>>> work after this flag has been set. At least that is what I understood.
>>>>>>>>>>>
>>>>>>>>>>> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
>>>>>>>>>>>
>>>>>>>>>>> -ritesh
>>>>>>>>>> Hey Ritesh,
>>>>>>>>>>
>>>>>>>>>> Yes that is not correct, I missed that in my patch however we realised
>>>>>>>>>> that adding it before flush_work() also has issues [1]. More
>>>>>>>>>> specifically:
>>>>>>>>> Ohk. right.
>>>>>>>>>
>>>>>>>>>>                       **kjournald2**
>>>>>>>>>>                       jbd2_journal_commit_transaction()
>>>>>>>>>>                       ...
>>>>>>>>>>                       ext4_handle_error()
>>>>>>>>>>                          /* s_journal_destorying is not set */
>>>>>>>>>>                          if (journal && !s_journal_destorying)
>>>>>>>>> Then maybe we should not schedule another work to update the superblock
>>>>>>>>> via journalling, it the error itself occurred while were trying to
>>>>>>>>> commit the journal txn?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -ritesh
>>>>>>>> Hmm, ideally yes that should not happen, but how can we achieve that?
>>>>>>>> For example with the trace we saw:
>>>>>>>>
>>>>>>>>     **kjournald2**
>>>>>>>>     jbd2_journal_commit_transaction()
>>>>>>>>       jbd2_journal_get_descriptor_buffer
>>>>>>>>         jbd2_journal_bmap
>>>>>>>>           ext4_journal_bmap
>>>>>>>>             ext4_map_blocks
>>>>>>>>               ...
>>>>>>>>               ext4_inode_error
>>>>>>>>                 ext4_handle_error
>>>>>>>>                   schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>> How do we tell ext4_handle_error that it is in the context of a
>>>>>>>> committing txn.
>>>> So I was thinking about this. It is not a problem to determine we are
>>>> running in kjournald context - it is enough to check
>>>>
>>>> 	current == EXT4_SB(sb)->s_journal->j_task
>>> Oh, right :)
>>>
>>>> But I'm not sure checking this in ext4_handle_error() and doing direct sb
>>>> update instead of scheduling a journalled one is always correct. For
>>>> example kjournald does also writeback of ordered data and if that hits an
>>>> error, we do not necessarily abort the journal (well, currently we do as
>>>> far as I'm checking but it seems a bit fragile to rely on this).
>>> Okay so IIUC your concern is there might be some codepaths, now or in
>>> the future, where kjournald might call the FS layer, hit an error and
>>> still decide to not abort. In which case we would still want to update
>>> the sb via journal.
>> Yeah. The reason why I'm a bit concerned about it is mostly the case of
>> kjournald also handling ordered data and situations like
>> !(journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR) where people want to
>> continue although ordered data had issues. Or situations where something in
>> j_commit_callback or another jbd2 hook ends up calling ext4_error()...
>>
> Ha, right! This is a case where kjournald triggers an ext4 error but does
> not abort the journal for now, I forgot this one, and there may be more.
> Thanks for pointing it out. I would also prefer to use this solution of
> adding ext4_journal_destory().
>
If we consider the possibility that there might be calls to ext4_error()
without aborting the journal, although I cannot imagine how this might
happen, this situation may indeed appear in hidden corners now or in the
future. Therefore, an extra flag is indeed needed, with which we don't
have to think so much. 😀


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-03-13  2:08 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-06 14:28 [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
2025-03-06 14:28 ` [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper Ojaswin Mujoo
2025-03-07 14:05   ` Jan Kara
2025-03-08  1:18   ` Baokun Li
2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
2025-03-07  2:49   ` Zhang Yi
2025-03-07  6:34     ` Ojaswin Mujoo
2025-03-07  8:13       ` Ojaswin Mujoo
2025-03-07  8:43         ` Zhang Yi
2025-03-07 10:27           ` Ojaswin Mujoo
2025-03-07 12:36             ` Zhang Yi
2025-03-07 17:26               ` Ojaswin Mujoo
2025-03-08  2:57                 ` Zhang Yi
2025-03-08  8:18                   ` Ojaswin Mujoo
2025-03-08  9:58                     ` Ojaswin Mujoo
2025-03-08 10:10                       ` Baokun Li
2025-03-08 12:57                         ` Ojaswin Mujoo
2025-03-08 10:01                   ` Ritesh Harjani
2025-03-07 14:26   ` Jan Kara
2025-03-07 17:00     ` Ojaswin Mujoo
2025-03-08  9:55   ` Ritesh Harjani (IBM)
2025-03-08 13:05     ` Ojaswin Mujoo
2025-03-08 13:26       ` Ritesh Harjani
2025-03-08 14:58         ` Ojaswin Mujoo
2025-03-08 18:41           ` Ritesh Harjani
2025-03-09 12:07             ` Ojaswin Mujoo
2025-03-10  4:43               ` Ritesh Harjani
2025-03-12 10:51                 ` Jan Kara
2025-03-12 14:26                   ` Ojaswin Mujoo
2025-03-12 17:15                     ` Jan Kara
2025-03-13  1:20                       ` Zhang Yi
2025-03-13  2:08                         ` Baokun Li
2025-03-12 13:57   ` Eric Sandeen
2025-03-12 14:27     ` Ojaswin Mujoo
2025-03-06 14:28 ` [PATCH v2 3/3] ext4: Make sb update interval tunable Ojaswin Mujoo
2025-03-08  2:25   ` Baokun Li
2025-03-08  1:09 ` [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Baokun Li
2025-03-08 13:07   ` Ojaswin Mujoo
2025-03-08 15:21     ` Ojaswin Mujoo
2025-03-10  3:10       ` Baokun Li
2025-03-10  7:59         ` Ojaswin Mujoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).