[PATCH 0/2] Fix a BUG_ON crashing the kernel in start_this

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix a BUG_ON crashing the kernel in start_this_handle
@ 2025-02-22  8:40 Ojaswin Mujoo
  2025-02-22  8:40 ` [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE Ojaswin Mujoo
  2025-02-22  8:40 ` [PATCH 2/2] ext4: Make sb update interval tunable Ojaswin Mujoo
  0 siblings, 2 replies; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-02-22  8:40 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Jan Kara, linux-kernel

When running LTP stress tests on ext4, after a multiday run we seemed to
have hit the following BUG_ON:

 [NIP  : start_this_handle+268]
 #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2]  (unreliable)
 #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
 #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
 #6 [c000001067c27c70] process_one_work at c000000000188790
 #7 [c000001067c27d20] worker_thread at c00000000018973c
 #8 [c000001067c27dc0] kthread at c000000000196c84
 #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64

Which comes out to

  382   repeat:
  383           read_lock(&journal->j_state_lock);
* 384           BUG_ON(journal->j_flags & JBD2_UNMOUNT);
  385           if (is_journal_aborted(journal) ||
  386               (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
  387                   read_unlock(&journal->j_state_lock);


Initially this seemed like it should never happen but upon crash
analysis it seems like it could indeed be hit as described in patch 1/2.

I would like to add that through the logs we only knew that:

- ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
- update_super_work had hit the BUG_ON

I was not able to hit this bug again (without modifying the kernel to
inject errors) but the above backtrace seems to be one possible paths
where this BUG_ON can be hit. Rest of the analysis and fix is in patch
1/2. Patch 2 is just a small tweak that i found helpful while debugging.

That being said, journalling is something I'm not very familiar with and
there might be gaps in my understanding so thoughts and suggestions are
welcome.

Ojaswin Mujoo (2):
  ext4: only defer sb update on error if SB_ACTIVE
  ext4: Make sb update interval tunable

 fs/ext4/ext4.h  |  9 +++++++++
 fs/ext4/super.c | 17 +++++++++--------
 fs/ext4/sysfs.c |  4 ++++
 3 files changed, 22 insertions(+), 8 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-22  8:40 [PATCH 0/2] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
@ 2025-02-22  8:40 ` Ojaswin Mujoo
  2025-02-24 14:52   ` Jan Kara
  2025-02-25  1:53   ` Baokun Li
  2025-02-22  8:40 ` [PATCH 2/2] ext4: Make sb update interval tunable Ojaswin Mujoo
  1 sibling, 2 replies; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-02-22  8:40 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Jan Kara, linux-kernel, Mahesh Kumar, Ritesh Harjani

Presently we always BUG_ON if trying to start a transaction on a journal
marked with JBD2_UNMOUNT, since this should never happen. However while
running stress tests it was observed that in case of some error handling
paths, it is possible for update_super_work to start a transaction after
the journal is destroyed eg:

(umount)
ext4_kill_sb
  kill_block_super
    generic_shutdown_super
      sync_filesystem /* commits all txns */
      evict_inodes
        /* might start a new txn */
      ext4_put_super
	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
        jbd2_journal_destroy
          journal_kill_thread
            journal->j_flags |= JBD2_UNMOUNT;
          jbd2_journal_commit_transaction
            jbd2_journal_get_descriptor_buffer
              jbd2_journal_bmap
                ext4_journal_bmap
                  ext4_map_blocks
                    ...
                    ext4_inode_error
                      ext4_handle_error
                        schedule_work(&sbi->s_sb_upd_work)

                                               /* work queue kicks in */
                                               update_super_work
                                                 jbd2_journal_start
                                                   start_this_handle
                                                     BUG_ON(journal->j_flags &
                                                            JBD2_UNMOUNT)

Hence, make sure we only defer the update of ext4 sb if the sb is still
active.  Otherwise, just fallback to an un-journaled commit.

The important thing to note here is that we must only defer sb update if
we have not yet flushed the s_sb_update_work queue in umount path else
this race can be hit (point 1 below). Since we don't have a direct way
to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
subtle so adding some notes below for future reference:

1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
== 0) however this is not correct since we could end up scheduling work
after it has been flushed:

 ext4_put_super
  flush_work(&sbi->s_sb_upd_work)

                           **kjournald2**
                           jbd2_journal_commit_transaction
                           ...
                           ext4_inode_error
                             /* JBD2_UNMOUNT not set */
                             schedule_work(s_sb_upd_work)

   jbd2_journal_destroy
    journal->j_flags |= JBD2_UNMOUNT;

                                      **workqueue**
                                      update_super_work
                                       jbd2_journal_start
                                        start_this_handle
                                          BUG_ON(JBD2_UNMOUNT)

Something like the above doesn't happen with SB_ACTIVE check because we
are sure that the workqueue would be flushed at a later point if we are
in the umount path.

2. We don't need a similar check in ext4_grp_locked_error since it is
only called from mballoc and AFAICT it would be always valid to schedule
work here.

Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a963ffda692a..b7341e9acf62 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
 		 * constraints, it may not be safe to do it right here so we
 		 * defer superblock flushing to a workqueue.
 		 */
-		if (continue_fs && journal)
+		if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
 			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
 		else
 			ext4_commit_super(sb);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] ext4: Make sb update interval tunable
  2025-02-22  8:40 [PATCH 0/2] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
  2025-02-22  8:40 ` [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE Ojaswin Mujoo
@ 2025-02-22  8:40 ` Ojaswin Mujoo
  2025-02-22  9:32   ` Ritesh Harjani (IBM)
  2025-02-24 14:27   ` Jan Kara
  1 sibling, 2 replies; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-02-22  8:40 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Jan Kara, linux-kernel

Currently, outside error paths, we auto commit the super block after 1
hour has passed and 16MB worth of updates have been written since last
commit. This is a policy decision so make this tunable while keeping the
defaults same. This is useful if user wants to tweak the superblock
update behavior or for debugging the codepath by allowing to trigger it
more frequently.

We can now tweak the super block update using sb_update_sec and
sb_update_kb files in /sys/fs/ext4/<dev>/

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h  |  9 +++++++++
 fs/ext4/super.c | 15 ++++++++-------
 fs/ext4/sysfs.c |  4 ++++
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2b7d781bfcad..306e5db7129c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1608,6 +1608,8 @@ struct ext4_sb_info {
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
+	unsigned int s_sb_update_sec;
+	unsigned int s_sb_update_kb;
 
 	/* stats for buddy allocator */
 	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
@@ -2279,6 +2281,13 @@ static inline int ext4_forced_shutdown(struct super_block *sb)
 #define EXT4_DEF_MIN_BATCH_TIME	0
 #define EXT4_DEF_MAX_BATCH_TIME	15000 /* 15ms */
 
+/*
+ * Default values for superblock update
+ */
+#define EXT4_DEF_SB_UPDATE_INTERVAL_SEC (3600) /* seconds (1 hour) */
+#define EXT4_DEF_SB_UPDATE_INTERVAL_KB (16384) /* kilobytes (16MB) */
+
+
 /*
  * Minimum number of groups in a flexgroup before we separate out
  * directories into the first block group of a flexgroup
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b7341e9acf62..129d666f450b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -447,9 +447,6 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
 #define ext4_get_tstamp(es, tstamp) \
 	__ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
 
-#define EXT4_SB_REFRESH_INTERVAL_SEC (3600) /* seconds (1 hour) */
-#define EXT4_SB_REFRESH_INTERVAL_KB (16384) /* kilobytes (16MB) */
-
 /*
  * The ext4_maybe_update_superblock() function checks and updates the
  * superblock if needed.
@@ -457,8 +454,10 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
  * This function is designed to update the on-disk superblock only under
  * certain conditions to prevent excessive disk writes and unnecessary
  * waking of the disk from sleep. The superblock will be updated if:
- * 1. More than an hour has passed since the last superblock update, and
- * 2. More than 16MB have been written since the last superblock update.
+ * 1. More than sbi->s_sb_update_sec (def: 1 hour) has passed since the last
+ *    superblock update
+ * 2. More than sbi->s_sb_update_kb (def: 16MB) kbs have been written since the
+ *    last superblock update.
  *
  * @sb: The superblock
  */
@@ -479,7 +478,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
 	now = ktime_get_real_seconds();
 	last_update = ext4_get_tstamp(es, s_wtime);
 
-	if (likely(now - last_update < EXT4_SB_REFRESH_INTERVAL_SEC))
+	if (likely(now - last_update < sbi->s_sb_update_sec))
 		return;
 
 	lifetime_write_kbytes = sbi->s_kbytes_written +
@@ -494,7 +493,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
 	 */
 	diff_size = lifetime_write_kbytes - le64_to_cpu(es->s_kbytes_written);
 
-	if (diff_size > EXT4_SB_REFRESH_INTERVAL_KB)
+	if (diff_size > sbi->s_sb_update_kb)
 		schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
 }
 
@@ -5248,6 +5247,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
 	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
+	sbi->s_sb_update_kb = EXT4_DEF_SB_UPDATE_INTERVAL_KB;
+	sbi->s_sb_update_sec = EXT4_DEF_SB_UPDATE_INTERVAL_SEC;
 
 	/*
 	 * set default s_li_wait_mult for lazyinit, for the case there is
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index ddb54608ca2e..987bd00f916a 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -254,6 +254,8 @@ EXT4_ATTR(journal_task, 0444, journal_task);
 EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
 EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
 EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
+EXT4_RW_ATTR_SBI_UI(sb_update_sec, s_sb_update_sec);
+EXT4_RW_ATTR_SBI_UI(sb_update_kb, s_sb_update_kb);
 
 static unsigned int old_bump_val = 128;
 EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
@@ -305,6 +307,8 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(mb_prefetch),
 	ATTR_LIST(mb_prefetch_limit),
 	ATTR_LIST(last_trim_minblks),
+	ATTR_LIST(sb_update_sec),
+	ATTR_LIST(sb_update_kb),
 	NULL,
 };
 ATTRIBUTE_GROUPS(ext4);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] ext4: Make sb update interval tunable
  2025-02-22  8:40 ` [PATCH 2/2] ext4: Make sb update interval tunable Ojaswin Mujoo
@ 2025-02-22  9:32   ` Ritesh Harjani (IBM)
  2025-02-24 14:27   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Ritesh Harjani (IBM) @ 2025-02-22  9:32 UTC (permalink / raw)
  To: Ojaswin Mujoo, linux-ext4, Theodore Ts'o; +Cc: Jan Kara, linux-kernel

Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:

> Currently, outside error paths, we auto commit the super block after 1
> hour has passed and 16MB worth of updates have been written since last
> commit. This is a policy decision so make this tunable while keeping the
> defaults same. This is useful if user wants to tweak the superblock
> update behavior or for debugging the codepath by allowing to trigger it
> more frequently.
>
> We can now tweak the super block update using sb_update_sec and
> sb_update_kb files in /sys/fs/ext4/<dev>/


Agree that this could be useful as a tunable knob for various reasons
rather than being a hardcoded value within kernel. 

The patch also looks good to me. Please feel free to add -

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] ext4: Make sb update interval tunable
  2025-02-22  8:40 ` [PATCH 2/2] ext4: Make sb update interval tunable Ojaswin Mujoo
  2025-02-22  9:32   ` Ritesh Harjani (IBM)
@ 2025-02-24 14:27   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2025-02-24 14:27 UTC (permalink / raw)
  To: Ojaswin Mujoo; +Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel

On Sat 22-02-25 14:10:23, Ojaswin Mujoo wrote:
> Currently, outside error paths, we auto commit the super block after 1
> hour has passed and 16MB worth of updates have been written since last
> commit. This is a policy decision so make this tunable while keeping the
> defaults same. This is useful if user wants to tweak the superblock
> update behavior or for debugging the codepath by allowing to trigger it
> more frequently.
> 
> We can now tweak the super block update using sb_update_sec and
> sb_update_kb files in /sys/fs/ext4/<dev>/
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Looks sensible. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4.h  |  9 +++++++++
>  fs/ext4/super.c | 15 ++++++++-------
>  fs/ext4/sysfs.c |  4 ++++
>  3 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2b7d781bfcad..306e5db7129c 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1608,6 +1608,8 @@ struct ext4_sb_info {
>  	unsigned int s_mb_prefetch;
>  	unsigned int s_mb_prefetch_limit;
>  	unsigned int s_mb_best_avail_max_trim_order;
> +	unsigned int s_sb_update_sec;
> +	unsigned int s_sb_update_kb;
>  
>  	/* stats for buddy allocator */
>  	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
> @@ -2279,6 +2281,13 @@ static inline int ext4_forced_shutdown(struct super_block *sb)
>  #define EXT4_DEF_MIN_BATCH_TIME	0
>  #define EXT4_DEF_MAX_BATCH_TIME	15000 /* 15ms */
>  
> +/*
> + * Default values for superblock update
> + */
> +#define EXT4_DEF_SB_UPDATE_INTERVAL_SEC (3600) /* seconds (1 hour) */
> +#define EXT4_DEF_SB_UPDATE_INTERVAL_KB (16384) /* kilobytes (16MB) */
> +
> +
>  /*
>   * Minimum number of groups in a flexgroup before we separate out
>   * directories into the first block group of a flexgroup
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b7341e9acf62..129d666f450b 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -447,9 +447,6 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
>  #define ext4_get_tstamp(es, tstamp) \
>  	__ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
>  
> -#define EXT4_SB_REFRESH_INTERVAL_SEC (3600) /* seconds (1 hour) */
> -#define EXT4_SB_REFRESH_INTERVAL_KB (16384) /* kilobytes (16MB) */
> -
>  /*
>   * The ext4_maybe_update_superblock() function checks and updates the
>   * superblock if needed.
> @@ -457,8 +454,10 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
>   * This function is designed to update the on-disk superblock only under
>   * certain conditions to prevent excessive disk writes and unnecessary
>   * waking of the disk from sleep. The superblock will be updated if:
> - * 1. More than an hour has passed since the last superblock update, and
> - * 2. More than 16MB have been written since the last superblock update.
> + * 1. More than sbi->s_sb_update_sec (def: 1 hour) has passed since the last
> + *    superblock update
> + * 2. More than sbi->s_sb_update_kb (def: 16MB) kbs have been written since the
> + *    last superblock update.
>   *
>   * @sb: The superblock
>   */
> @@ -479,7 +478,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
>  	now = ktime_get_real_seconds();
>  	last_update = ext4_get_tstamp(es, s_wtime);
>  
> -	if (likely(now - last_update < EXT4_SB_REFRESH_INTERVAL_SEC))
> +	if (likely(now - last_update < sbi->s_sb_update_sec))
>  		return;
>  
>  	lifetime_write_kbytes = sbi->s_kbytes_written +
> @@ -494,7 +493,7 @@ static void ext4_maybe_update_superblock(struct super_block *sb)
>  	 */
>  	diff_size = lifetime_write_kbytes - le64_to_cpu(es->s_kbytes_written);
>  
> -	if (diff_size > EXT4_SB_REFRESH_INTERVAL_KB)
> +	if (diff_size > sbi->s_sb_update_kb)
>  		schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>  }
>  
> @@ -5248,6 +5247,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
>  	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
>  	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
> +	sbi->s_sb_update_kb = EXT4_DEF_SB_UPDATE_INTERVAL_KB;
> +	sbi->s_sb_update_sec = EXT4_DEF_SB_UPDATE_INTERVAL_SEC;
>  
>  	/*
>  	 * set default s_li_wait_mult for lazyinit, for the case there is
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index ddb54608ca2e..987bd00f916a 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -254,6 +254,8 @@ EXT4_ATTR(journal_task, 0444, journal_task);
>  EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
>  EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
>  EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
> +EXT4_RW_ATTR_SBI_UI(sb_update_sec, s_sb_update_sec);
> +EXT4_RW_ATTR_SBI_UI(sb_update_kb, s_sb_update_kb);
>  
>  static unsigned int old_bump_val = 128;
>  EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
> @@ -305,6 +307,8 @@ static struct attribute *ext4_attrs[] = {
>  	ATTR_LIST(mb_prefetch),
>  	ATTR_LIST(mb_prefetch_limit),
>  	ATTR_LIST(last_trim_minblks),
> +	ATTR_LIST(sb_update_sec),
> +	ATTR_LIST(sb_update_kb),
>  	NULL,
>  };
>  ATTRIBUTE_GROUPS(ext4);
> -- 
> 2.48.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-22  8:40 ` [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE Ojaswin Mujoo
@ 2025-02-24 14:52   ` Jan Kara
  2025-02-27 11:20     ` Ojaswin Mujoo
  2025-02-25  1:53   ` Baokun Li
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-02-24 14:52 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Mahesh Kumar, Ritesh Harjani

On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
> Presently we always BUG_ON if trying to start a transaction on a journal
> marked with JBD2_UNMOUNT, since this should never happen. However while
> running stress tests it was observed that in case of some error handling
> paths, it is possible for update_super_work to start a transaction after
> the journal is destroyed eg:
> 
> (umount)
> ext4_kill_sb
>   kill_block_super
>     generic_shutdown_super
>       sync_filesystem /* commits all txns */
>       evict_inodes
>         /* might start a new txn */
>       ext4_put_super
> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>         jbd2_journal_destroy
>           journal_kill_thread
>             journal->j_flags |= JBD2_UNMOUNT;
>           jbd2_journal_commit_transaction
>             jbd2_journal_get_descriptor_buffer
>               jbd2_journal_bmap
>                 ext4_journal_bmap
>                   ext4_map_blocks
>                     ...
>                     ext4_inode_error
>                       ext4_handle_error
>                         schedule_work(&sbi->s_sb_upd_work)
> 
>                                                /* work queue kicks in */
>                                                update_super_work
>                                                  jbd2_journal_start
>                                                    start_this_handle
>                                                      BUG_ON(journal->j_flags &
>                                                             JBD2_UNMOUNT)
> 
> Hence, make sure we only defer the update of ext4 sb if the sb is still
> active.  Otherwise, just fallback to an un-journaled commit.
> 
> The important thing to note here is that we must only defer sb update if
> we have not yet flushed the s_sb_update_work queue in umount path else
> this race can be hit (point 1 below). Since we don't have a direct way
> to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> subtle so adding some notes below for future reference:
> 
> 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> == 0) however this is not correct since we could end up scheduling work
> after it has been flushed:
> 
>  ext4_put_super
>   flush_work(&sbi->s_sb_upd_work)
> 
>                            **kjournald2**
>                            jbd2_journal_commit_transaction
>                            ...
>                            ext4_inode_error
>                              /* JBD2_UNMOUNT not set */
>                              schedule_work(s_sb_upd_work)
> 
>    jbd2_journal_destroy
>     journal->j_flags |= JBD2_UNMOUNT;
> 
>                                       **workqueue**
>                                       update_super_work
>                                        jbd2_journal_start
>                                         start_this_handle
>                                           BUG_ON(JBD2_UNMOUNT)
> 
> Something like the above doesn't happen with SB_ACTIVE check because we
> are sure that the workqueue would be flushed at a later point if we are
> in the umount path.
> 
> 2. We don't need a similar check in ext4_grp_locked_error since it is
> only called from mballoc and AFAICT it would be always valid to schedule
> work here.
> 
> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Good catch! But I think the solution will have to be slightly different.
Basing the check on SB_ACTIVE has the problem that you can have racing
updates of the sb in the still running transaction and in your direct
update leading to inconsistencies after a crash (that was the reason why
we've created the s_sb_upd_work in the first place).

I would solve this by implementing something like
ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
workqueue, and then destroy the journal. And ext4_handle_error() will check
for the sbi flag.

								Honza

> ---
>  fs/ext4/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a963ffda692a..b7341e9acf62 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>  		 * constraints, it may not be safe to do it right here so we
>  		 * defer superblock flushing to a workqueue.
>  		 */
> -		if (continue_fs && journal)
> +		if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>  		else
>  			ext4_commit_super(sb);
> -- 
> 2.48.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-22  8:40 ` [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE Ojaswin Mujoo
  2025-02-24 14:52   ` Jan Kara
@ 2025-02-25  1:53   ` Baokun Li
  2025-02-25 12:06     ` Ojaswin Mujoo
  1 sibling, 1 reply; 16+ messages in thread
From: Baokun Li @ 2025-02-25  1:53 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Mahesh Kumar, Ritesh Harjani, Yang Erkun

On 2025/2/22 16:40, Ojaswin Mujoo wrote:
> Presently we always BUG_ON if trying to start a transaction on a journal
> marked with JBD2_UNMOUNT, since this should never happen. However while
> running stress tests it was observed that in case of some error handling
> paths, it is possible for update_super_work to start a transaction after
> the journal is destroyed eg:
>
> (umount)
> ext4_kill_sb
>    kill_block_super
>      generic_shutdown_super
>        sync_filesystem /* commits all txns */
>        evict_inodes
>          /* might start a new txn */
>        ext4_put_super
> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>          jbd2_journal_destroy
>            journal_kill_thread
>              journal->j_flags |= JBD2_UNMOUNT;
>            jbd2_journal_commit_transaction
>              jbd2_journal_get_descriptor_buffer
>                jbd2_journal_bmap
>                  ext4_journal_bmap
>                    ext4_map_blocks
>                      ...
>                      ext4_inode_error
Just curious, since jbd2_journal_bmap() only queries the map and does not
create it, how does it fail here? Is there more information in dmesg?
Is s_journal_inum normal after file system corruption?

Thanks,
Baokun
>                        ext4_handle_error
>                          schedule_work(&sbi->s_sb_upd_work)
>
>                                                 /* work queue kicks in */
>                                                 update_super_work
>                                                   jbd2_journal_start
>                                                     start_this_handle
>                                                       BUG_ON(journal->j_flags &
>                                                              JBD2_UNMOUNT)
>
> Hence, make sure we only defer the update of ext4 sb if the sb is still
> active.  Otherwise, just fallback to an un-journaled commit.
>
> The important thing to note here is that we must only defer sb update if
> we have not yet flushed the s_sb_update_work queue in umount path else
> this race can be hit (point 1 below). Since we don't have a direct way
> to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> subtle so adding some notes below for future reference:
>
> 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> == 0) however this is not correct since we could end up scheduling work
> after it has been flushed:
>
>   ext4_put_super
>    flush_work(&sbi->s_sb_upd_work)
>
>                             **kjournald2**
>                             jbd2_journal_commit_transaction
>                             ...
>                             ext4_inode_error
>                               /* JBD2_UNMOUNT not set */
>                               schedule_work(s_sb_upd_work)
>
>     jbd2_journal_destroy
>      journal->j_flags |= JBD2_UNMOUNT;
>
>                                        **workqueue**
>                                        update_super_work
>                                         jbd2_journal_start
>                                          start_this_handle
>                                            BUG_ON(JBD2_UNMOUNT)
>
> Something like the above doesn't happen with SB_ACTIVE check because we
> are sure that the workqueue would be flushed at a later point if we are
> in the umount path.
>
> 2. We don't need a similar check in ext4_grp_locked_error since it is
> only called from mballoc and AFAICT it would be always valid to schedule
> work here.
>
> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>   fs/ext4/super.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a963ffda692a..b7341e9acf62 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>   		 * constraints, it may not be safe to do it right here so we
>   		 * defer superblock flushing to a workqueue.
>   		 */
> -		if (continue_fs && journal)
> +		if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
>   			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>   		else
>   			ext4_commit_super(sb);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-25  1:53   ` Baokun Li
@ 2025-02-25 12:06     ` Ojaswin Mujoo
  2025-02-26  1:55       ` Baokun Li
  0 siblings, 1 reply; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-02-25 12:06 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Mahesh Kumar, Ritesh Harjani, Yang Erkun

On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
> On 2025/2/22 16:40, Ojaswin Mujoo wrote:
> > Presently we always BUG_ON if trying to start a transaction on a journal
> > marked with JBD2_UNMOUNT, since this should never happen. However while
> > running stress tests it was observed that in case of some error handling
> > paths, it is possible for update_super_work to start a transaction after
> > the journal is destroyed eg:
> > 
> > (umount)
> > ext4_kill_sb
> >    kill_block_super
> >      generic_shutdown_super
> >        sync_filesystem /* commits all txns */
> >        evict_inodes
> >          /* might start a new txn */
> >        ext4_put_super
> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >          jbd2_journal_destroy
> >            journal_kill_thread
> >              journal->j_flags |= JBD2_UNMOUNT;
> >            jbd2_journal_commit_transaction
> >              jbd2_journal_get_descriptor_buffer
> >                jbd2_journal_bmap
> >                  ext4_journal_bmap
> >                    ext4_map_blocks
> >                      ...
> >                      ext4_inode_error
> Just curious, since jbd2_journal_bmap() only queries the map and does not
> create it, how does it fail here? Is there more information in dmesg?
> Is s_journal_inum normal after file system corruption?

Hey Baokun, 

So I dug a bit more into the vmcore. The error information in sbi looks
like this:

  s_add_error_count = 1,
  s_first_error_code = 117,
  s_first_error_line = 475,
  s_first_error_ino = 0,
  s_first_error_block = 0,
  s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
  s_first_error_time = 1737023235,

  s_last_error_code = 117,
  s_last_error_line = 609,
  s_last_error_ino = 8,
  s_last_error_block = 783,
  s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
  s_last_error_time = 1737023236,

  The first error is here:

      if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
     474               (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
  *  475                   ext4_error(sb, "Invalid block bitmap block %llu in "
     476                              "block_group %u", bitmap_blk, block_group);
     477                   ext4_mark_group_bitmap_corrupted(sb, block_group,
     478                                           EXT4_GROUP_INFO_BBITMAP_CORRUPT);
     479                   return ERR_PTR(-EFSCORRUPTED);
     480           }

and the last error is here:

    608           if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
 *  609                   ret = check_block_validity(inode, map);
    610                   if (ret != 0)
    611                           return ret;
    612           }


And indeed we have the traces of the first error in dmesg:

[75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
[75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
[75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0

However, the last error seems strange. It seems like check_block_validity
should ideally never fail for a journal inode. Unfortunately, sbi->s_es page is
not recorded in the crash dump for some reason so idk the exact value at the
time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
I'll look a bit more into the proc01 ltp to see if we can recreate the failure
to get more info.

> 
> Thanks,
> Baokun
> >                        ext4_handle_error
> >                          schedule_work(&sbi->s_sb_upd_work)
> > 
> >                                                 /* work queue kicks in */
> >                                                 update_super_work
> >                                                   jbd2_journal_start
> >                                                     start_this_handle
> >                                                       BUG_ON(journal->j_flags &
> >                                                              JBD2_UNMOUNT)
> > 
> > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > active.  Otherwise, just fallback to an un-journaled commit.
> > 
> > The important thing to note here is that we must only defer sb update if
> > we have not yet flushed the s_sb_update_work queue in umount path else
> > this race can be hit (point 1 below). Since we don't have a direct way
> > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > subtle so adding some notes below for future reference:
> > 
> > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > == 0) however this is not correct since we could end up scheduling work
> > after it has been flushed:
> > 
> >   ext4_put_super
> >    flush_work(&sbi->s_sb_upd_work)
> > 
> >                             **kjournald2**
> >                             jbd2_journal_commit_transaction
> >                             ...
> >                             ext4_inode_error
> >                               /* JBD2_UNMOUNT not set */
> >                               schedule_work(s_sb_upd_work)
> > 
> >     jbd2_journal_destroy
> >      journal->j_flags |= JBD2_UNMOUNT;
> > 
> >                                        **workqueue**
> >                                        update_super_work
> >                                         jbd2_journal_start
> >                                          start_this_handle
> >                                            BUG_ON(JBD2_UNMOUNT)
> > 
> > Something like the above doesn't happen with SB_ACTIVE check because we
> > are sure that the workqueue would be flushed at a later point if we are
> > in the umount path.
> > 
> > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > only called from mballoc and AFAICT it would be always valid to schedule
> > work here.
> > 
> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > ---
> >   fs/ext4/super.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index a963ffda692a..b7341e9acf62 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >   		 * constraints, it may not be safe to do it right here so we
> >   		 * defer superblock flushing to a workqueue.
> >   		 */
> > -		if (continue_fs && journal)
> > +		if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
> >   			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >   		else
> >   			ext4_commit_super(sb);
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-25 12:06     ` Ojaswin Mujoo
@ 2025-02-26  1:55       ` Baokun Li
  2025-02-27 11:43         ` Ojaswin Mujoo
  0 siblings, 1 reply; 16+ messages in thread
From: Baokun Li @ 2025-02-26  1:55 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Mahesh Kumar, Ritesh Harjani, Yang Erkun

On 2025/2/25 20:06, Ojaswin Mujoo wrote:
> On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
>> On 2025/2/22 16:40, Ojaswin Mujoo wrote:
>>> Presently we always BUG_ON if trying to start a transaction on a journal
>>> marked with JBD2_UNMOUNT, since this should never happen. However while
>>> running stress tests it was observed that in case of some error handling
>>> paths, it is possible for update_super_work to start a transaction after
>>> the journal is destroyed eg:
>>>
>>> (umount)
>>> ext4_kill_sb
>>>     kill_block_super
>>>       generic_shutdown_super
>>>         sync_filesystem /* commits all txns */
>>>         evict_inodes
>>>           /* might start a new txn */
>>>         ext4_put_super
>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>           jbd2_journal_destroy
>>>             journal_kill_thread
>>>               journal->j_flags |= JBD2_UNMOUNT;
>>>             jbd2_journal_commit_transaction
>>>               jbd2_journal_get_descriptor_buffer
>>>                 jbd2_journal_bmap
>>>                   ext4_journal_bmap
>>>                     ext4_map_blocks
>>>                       ...
>>>                       ext4_inode_error
>> Just curious, since jbd2_journal_bmap() only queries the map and does not
>> create it, how does it fail here? Is there more information in dmesg?
>> Is s_journal_inum normal after file system corruption?
> Hey Baokun,
Hello Ojaswin,

Thanks for your detailed explanation!
>
> So I dug a bit more into the vmcore. The error information in sbi looks
> like this:
>
>    s_add_error_count = 1,
>    s_first_error_code = 117,
>    s_first_error_line = 475,
>    s_first_error_ino = 0,
>    s_first_error_block = 0,
>    s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
>    s_first_error_time = 1737023235,
>
>    s_last_error_code = 117,
>    s_last_error_line = 609,
>    s_last_error_ino = 8,
>    s_last_error_block = 783,
>    s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
>    s_last_error_time = 1737023236,
>
>    The first error is here:
>
>        if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
>       474               (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
>    *  475                   ext4_error(sb, "Invalid block bitmap block %llu in "
>       476                              "block_group %u", bitmap_blk, block_group);
>       477                   ext4_mark_group_bitmap_corrupted(sb, block_group,
>       478                                           EXT4_GROUP_INFO_BBITMAP_CORRUPT);
>       479                   return ERR_PTR(-EFSCORRUPTED);
>       480           }
>
> and the last error is here:
>
>      608           if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
>   *  609                   ret = check_block_validity(inode, map);
>      610                   if (ret != 0)
>      611                           return ret;
>      612           }
>
>
> And indeed we have the traces of the first error in dmesg:
>
> [75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> [75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> [75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
>
> However, the last error seems strange. It seems like check_block_validity
> should ideally never fail for a journal inode.Unfortunately, sbi->s_es page is
> not recorded in the crash dump for some reason so idk the exact value at the
> time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
> which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
> I'll look a bit more into the proc01 ltp to see if we can recreate the failure
> to get more info.
Right, check_block_validity() skips the journal inode check. If
the journal inode check fails, that shows s_es->s_journal_inum and
journal->j_inode->i_ino are different. The file system doesn't modify
s_journal_inum, so it should be modified by some other writer bypassing
the file system (i.e. writing to bare disk).

If that's how it is, we could avoid this issue by using EXT4_JOURNAL_INO
directly or saving s_journal_inum to ext4_sb_info (which offers better
compatibility).


Cheers,
Baokun
>> Thanks,
>> Baokun
>>>                         ext4_handle_error
>>>                           schedule_work(&sbi->s_sb_upd_work)
>>>
>>>                                                  /* work queue kicks in */
>>>                                                  update_super_work
>>>                                                    jbd2_journal_start
>>>                                                      start_this_handle
>>>                                                        BUG_ON(journal->j_flags &
>>>                                                               JBD2_UNMOUNT)
>>>
>>> Hence, make sure we only defer the update of ext4 sb if the sb is still
>>> active.  Otherwise, just fallback to an un-journaled commit.
>>>
>>> The important thing to note here is that we must only defer sb update if
>>> we have not yet flushed the s_sb_update_work queue in umount path else
>>> this race can be hit (point 1 below). Since we don't have a direct way
>>> to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
>>> subtle so adding some notes below for future reference:
>>>
>>> 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
>>> == 0) however this is not correct since we could end up scheduling work
>>> after it has been flushed:
>>>
>>>    ext4_put_super
>>>     flush_work(&sbi->s_sb_upd_work)
>>>
>>>                              **kjournald2**
>>>                              jbd2_journal_commit_transaction
>>>                              ...
>>>                              ext4_inode_error
>>>                                /* JBD2_UNMOUNT not set */
>>>                                schedule_work(s_sb_upd_work)
>>>
>>>      jbd2_journal_destroy
>>>       journal->j_flags |= JBD2_UNMOUNT;
>>>
>>>                                         **workqueue**
>>>                                         update_super_work
>>>                                          jbd2_journal_start
>>>                                           start_this_handle
>>>                                             BUG_ON(JBD2_UNMOUNT)
>>>
>>> Something like the above doesn't happen with SB_ACTIVE check because we
>>> are sure that the workqueue would be flushed at a later point if we are
>>> in the umount path.
>>>
>>> 2. We don't need a similar check in ext4_grp_locked_error since it is
>>> only called from mballoc and AFAICT it would be always valid to schedule
>>> work here.
>>>
>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>> Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>> ---
>>>    fs/ext4/super.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index a963ffda692a..b7341e9acf62 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>>>    		 * constraints, it may not be safe to do it right here so we
>>>    		 * defer superblock flushing to a workqueue.
>>>    		 */
>>> -		if (continue_fs && journal)
>>> +		if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
>>>    			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>>>    		else
>>>    			ext4_commit_super(sb);
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-24 14:52   ` Jan Kara
@ 2025-02-27 11:20     ` Ojaswin Mujoo
  2025-03-04 10:25       ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-02-27 11:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, linux-kernel, Mahesh Kumar,
	Ritesh Harjani

On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote:
> On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
> > Presently we always BUG_ON if trying to start a transaction on a journal
> > marked with JBD2_UNMOUNT, since this should never happen. However while
> > running stress tests it was observed that in case of some error handling
> > paths, it is possible for update_super_work to start a transaction after
> > the journal is destroyed eg:
> > 
> > (umount)
> > ext4_kill_sb
> >   kill_block_super
> >     generic_shutdown_super
> >       sync_filesystem /* commits all txns */
> >       evict_inodes
> >         /* might start a new txn */
> >       ext4_put_super
> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >         jbd2_journal_destroy
> >           journal_kill_thread
> >             journal->j_flags |= JBD2_UNMOUNT;
> >           jbd2_journal_commit_transaction
> >             jbd2_journal_get_descriptor_buffer
> >               jbd2_journal_bmap
> >                 ext4_journal_bmap
> >                   ext4_map_blocks
> >                     ...
> >                     ext4_inode_error
> >                       ext4_handle_error
> >                         schedule_work(&sbi->s_sb_upd_work)
> > 
> >                                                /* work queue kicks in */
> >                                                update_super_work
> >                                                  jbd2_journal_start
> >                                                    start_this_handle
> >                                                      BUG_ON(journal->j_flags &
> >                                                             JBD2_UNMOUNT)
> > 
> > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > active.  Otherwise, just fallback to an un-journaled commit.
> > 
> > The important thing to note here is that we must only defer sb update if
> > we have not yet flushed the s_sb_update_work queue in umount path else
> > this race can be hit (point 1 below). Since we don't have a direct way
> > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > subtle so adding some notes below for future reference:
> > 
> > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > == 0) however this is not correct since we could end up scheduling work
> > after it has been flushed:
> > 
> >  ext4_put_super
> >   flush_work(&sbi->s_sb_upd_work)
> > 
> >                            **kjournald2**
> >                            jbd2_journal_commit_transaction
> >                            ...
> >                            ext4_inode_error
> >                              /* JBD2_UNMOUNT not set */
> >                              schedule_work(s_sb_upd_work)
> > 
> >    jbd2_journal_destroy
> >     journal->j_flags |= JBD2_UNMOUNT;
> > 
> >                                       **workqueue**
> >                                       update_super_work
> >                                        jbd2_journal_start
> >                                         start_this_handle
> >                                           BUG_ON(JBD2_UNMOUNT)
> > 
> > Something like the above doesn't happen with SB_ACTIVE check because we
> > are sure that the workqueue would be flushed at a later point if we are
> > in the umount path.
> > 
> > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > only called from mballoc and AFAICT it would be always valid to schedule
> > work here.
> > 
> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Good catch! But I think the solution will have to be slightly different.
> Basing the check on SB_ACTIVE has the problem that you can have racing
> updates of the sb in the still running transaction and in your direct
> update leading to inconsistencies after a crash (that was the reason why
> we've created the s_sb_upd_work in the first place).
> 
> I would solve this by implementing something like
> ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
> workqueue, and then destroy the journal. And ext4_handle_error() will check
> for the sbi flag.
> 
> 								Honza

Hey Jan,

Thanks for the review. So earlier I did go through different code paths to see
if we will have a direct sb write clash with a journalled one it wouldn't but,
relooking at it, seems like we might have a scenario as follows:

generic_super_shutdown
 sync_filesytems
  /* running txns committed. executing ext4_journal_commit_callback */
  ext4_maybe_update_superblock
   /* schedules work */
   schedule_work(&sbi->s_sb_upd_work)
                                          update_super_work
                                          /* start a txn and add sb to it */
 sb->s_flags &= ~SB_ACTIVE;
 evict_inode
   ext4_evict_inode
    ext4_std_error
     ext4_handle_error
      /* direct commit of sb (Not good!) */


Now with the 'setting the flag in sbi' approach, I'm not sure if that will be
enough to handle this as well. For example, if we add a flag like
sbi->s_journal_destroying, then:

ext4_put_super
 sbi->s_journal_destroying = true
 flush_workqueue()
  /* sb is now journalled */
 jbd2_journal_destory
  jbd2_journal_commit_transaction
   /* add tag for sb in descriptor and add buffer to wbufs[] */
   /* Later from some other buffer in the txn: */
   jbd2_journal_next_log_block
    /* hits error in ext4_journal_bmap */
    ext4_handle_error
      sbi->s_journal_destroying == true
      /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */
   jbd2_journal_abort
   /* after abort everything in wbufs[] is written to journal */

In the above we will have a checksum mismatch but then maybe its not really
an issue. Maybe since we never commit the txn it is understood that the contents
can't be trusted and it should be fine to have a mismatch b/w the decriptor tag
and the actual super block contents? In which case the sbi flag approach should
be fine.

Does my understanding sound correct?


Regards,
ojaswin
> 
> > ---
> >  fs/ext4/super.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index a963ffda692a..b7341e9acf62 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> >  		 * constraints, it may not be safe to do it right here so we
> >  		 * defer superblock flushing to a workqueue.
> >  		 */
> > -		if (continue_fs && journal)
> > +		if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
> >  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> >  		else
> >  			ext4_commit_super(sb);
> > -- 
> > 2.48.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-26  1:55       ` Baokun Li
@ 2025-02-27 11:43         ` Ojaswin Mujoo
  2025-02-27 12:51           ` Baokun Li
  0 siblings, 1 reply; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-02-27 11:43 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Mahesh Kumar, Ritesh Harjani, Yang Erkun

On Wed, Feb 26, 2025 at 09:55:52AM +0800, Baokun Li wrote:
> On 2025/2/25 20:06, Ojaswin Mujoo wrote:
> > On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
> > > On 2025/2/22 16:40, Ojaswin Mujoo wrote:
> > > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > > running stress tests it was observed that in case of some error handling
> > > > paths, it is possible for update_super_work to start a transaction after
> > > > the journal is destroyed eg:
> > > > 
> > > > (umount)
> > > > ext4_kill_sb
> > > >     kill_block_super
> > > >       generic_shutdown_super
> > > >         sync_filesystem /* commits all txns */
> > > >         evict_inodes
> > > >           /* might start a new txn */
> > > >         ext4_put_super
> > > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > >           jbd2_journal_destroy
> > > >             journal_kill_thread
> > > >               journal->j_flags |= JBD2_UNMOUNT;
> > > >             jbd2_journal_commit_transaction
> > > >               jbd2_journal_get_descriptor_buffer
> > > >                 jbd2_journal_bmap
> > > >                   ext4_journal_bmap
> > > >                     ext4_map_blocks
> > > >                       ...
> > > >                       ext4_inode_error
> > > Just curious, since jbd2_journal_bmap() only queries the map and does not
> > > create it, how does it fail here? Is there more information in dmesg?
> > > Is s_journal_inum normal after file system corruption?
> > Hey Baokun,
> Hello Ojaswin,
> 
> Thanks for your detailed explanation!
> > 
> > So I dug a bit more into the vmcore. The error information in sbi looks
> > like this:
> > 
> >    s_add_error_count = 1,
> >    s_first_error_code = 117,
> >    s_first_error_line = 475,
> >    s_first_error_ino = 0,
> >    s_first_error_block = 0,
> >    s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
> >    s_first_error_time = 1737023235,
> > 
> >    s_last_error_code = 117,
> >    s_last_error_line = 609,
> >    s_last_error_ino = 8,
> >    s_last_error_block = 783,
> >    s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
> >    s_last_error_time = 1737023236,
> > 
> >    The first error is here:
> > 
> >        if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
> >       474               (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
> >    *  475                   ext4_error(sb, "Invalid block bitmap block %llu in "
> >       476                              "block_group %u", bitmap_blk, block_group);
> >       477                   ext4_mark_group_bitmap_corrupted(sb, block_group,
> >       478                                           EXT4_GROUP_INFO_BBITMAP_CORRUPT);
> >       479                   return ERR_PTR(-EFSCORRUPTED);
> >       480           }
> > 
> > and the last error is here:
> > 
> >      608           if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> >   *  609                   ret = check_block_validity(inode, map);
> >      610                   if (ret != 0)
> >      611                           return ret;
> >      612           }
> > 
> > 
> > And indeed we have the traces of the first error in dmesg:
> > 
> > [75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > [75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > [75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > 
> > However, the last error seems strange. It seems like check_block_validity
> > should ideally never fail for a journal inode.Unfortunately, sbi->s_es page is
> > not recorded in the crash dump for some reason so idk the exact value at the
> > time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
> > which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
> > I'll look a bit more into the proc01 ltp to see if we can recreate the failure
> > to get more info.
> Right, check_block_validity() skips the journal inode check. If
> the journal inode check fails, that shows s_es->s_journal_inum and
> journal->j_inode->i_ino are different. The file system doesn't modify
> s_journal_inum, so it should be modified by some other writer bypassing
> the file system (i.e. writing to bare disk).
> 
> If that's how it is, we could avoid this issue by using EXT4_JOURNAL_INO
> directly or saving s_journal_inum to ext4_sb_info (which offers better
> compatibility).

Hey Baokun,

So I'm wondering if modifying the check in __check_block_validity the
right thing to do. The fact that something has changed the on disk
journal inode number is itself concerning and maybe stopping IO here is
the right thing to do instead of using a cached journal inode number and
silently ignoring the issue.

Anyways now that the journal inode is corrupted on disk, will the
recovery tooling even be able to read the journal anymore? 

Regards,
ojaswin

> 
> 
> Cheers,
> Baokun

> > > Thanks,
> > > Baokun
> > > >                         ext4_handle_error
> > > >                           schedule_work(&sbi->s_sb_upd_work)
> > > > 
> > > >                                                  /* work queue kicks in */
> > > >                                                  update_super_work
> > > >                                                    jbd2_journal_start
> > > >                                                      start_this_handle
> > > >                                                        BUG_ON(journal->j_flags &
> > > >                                                               JBD2_UNMOUNT)
> > > > 
> > > > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > > > active.  Otherwise, just fallback to an un-journaled commit.
> > > > 
> > > > The important thing to note here is that we must only defer sb update if
> > > > we have not yet flushed the s_sb_update_work queue in umount path else
> > > > this race can be hit (point 1 below). Since we don't have a direct way
> > > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > > > subtle so adding some notes below for future reference:
> > > > 
> > > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > > > == 0) however this is not correct since we could end up scheduling work
> > > > after it has been flushed:
> > > > 
> > > >    ext4_put_super
> > > >     flush_work(&sbi->s_sb_upd_work)
> > > > 
> > > >                              **kjournald2**
> > > >                              jbd2_journal_commit_transaction
> > > >                              ...
> > > >                              ext4_inode_error
> > > >                                /* JBD2_UNMOUNT not set */
> > > >                                schedule_work(s_sb_upd_work)
> > > > 
> > > >      jbd2_journal_destroy
> > > >       journal->j_flags |= JBD2_UNMOUNT;
> > > > 
> > > >                                         **workqueue**
> > > >                                         update_super_work
> > > >                                          jbd2_journal_start
> > > >                                           start_this_handle
> > > >                                             BUG_ON(JBD2_UNMOUNT)
> > > > 
> > > > Something like the above doesn't happen with SB_ACTIVE check because we
> > > > are sure that the workqueue would be flushed at a later point if we are
> > > > in the umount path.
> > > > 
> > > > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > > > only called from mballoc and AFAICT it would be always valid to schedule
> > > > work here.
> > > > 
> > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > > > Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > ---
> > > >    fs/ext4/super.c | 2 +-
> > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > > index a963ffda692a..b7341e9acf62 100644
> > > > --- a/fs/ext4/super.c
> > > > +++ b/fs/ext4/super.c
> > > > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> > > >    		 * constraints, it may not be safe to do it right here so we
> > > >    		 * defer superblock flushing to a workqueue.
> > > >    		 */
> > > > -		if (continue_fs && journal)
> > > > +		if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
> > > >    			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> > > >    		else
> > > >    			ext4_commit_super(sb);
> > > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-27 11:43         ` Ojaswin Mujoo
@ 2025-02-27 12:51           ` Baokun Li
  2025-03-03 17:30             ` Ojaswin Mujoo
  0 siblings, 1 reply; 16+ messages in thread
From: Baokun Li @ 2025-02-27 12:51 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Mahesh Kumar, Ritesh Harjani, Yang Erkun

On 2025/2/27 19:43, Ojaswin Mujoo wrote:
> On Wed, Feb 26, 2025 at 09:55:52AM +0800, Baokun Li wrote:
>> On 2025/2/25 20:06, Ojaswin Mujoo wrote:
>>> On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
>>>> On 2025/2/22 16:40, Ojaswin Mujoo wrote:
>>>>> Presently we always BUG_ON if trying to start a transaction on a journal
>>>>> marked with JBD2_UNMOUNT, since this should never happen. However while
>>>>> running stress tests it was observed that in case of some error handling
>>>>> paths, it is possible for update_super_work to start a transaction after
>>>>> the journal is destroyed eg:
>>>>>
>>>>> (umount)
>>>>> ext4_kill_sb
>>>>>      kill_block_super
>>>>>        generic_shutdown_super
>>>>>          sync_filesystem /* commits all txns */
>>>>>          evict_inodes
>>>>>            /* might start a new txn */
>>>>>          ext4_put_super
>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>            jbd2_journal_destroy
>>>>>              journal_kill_thread
>>>>>                journal->j_flags |= JBD2_UNMOUNT;
>>>>>              jbd2_journal_commit_transaction
>>>>>                jbd2_journal_get_descriptor_buffer
>>>>>                  jbd2_journal_bmap
>>>>>                    ext4_journal_bmap
>>>>>                      ext4_map_blocks
>>>>>                        ...
>>>>>                        ext4_inode_error
>>>> Just curious, since jbd2_journal_bmap() only queries the map and does not
>>>> create it, how does it fail here? Is there more information in dmesg?
>>>> Is s_journal_inum normal after file system corruption?
>>> Hey Baokun,
>> Hello Ojaswin,
>>
>> Thanks for your detailed explanation!
>>> So I dug a bit more into the vmcore. The error information in sbi looks
>>> like this:
>>>
>>>     s_add_error_count = 1,
>>>     s_first_error_code = 117,
>>>     s_first_error_line = 475,
>>>     s_first_error_ino = 0,
>>>     s_first_error_block = 0,
>>>     s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
>>>     s_first_error_time = 1737023235,
>>>
>>>     s_last_error_code = 117,
>>>     s_last_error_line = 609,
>>>     s_last_error_ino = 8,
>>>     s_last_error_block = 783,
>>>     s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
>>>     s_last_error_time = 1737023236,
>>>
>>>     The first error is here:
>>>
>>>         if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
>>>        474               (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
>>>     *  475                   ext4_error(sb, "Invalid block bitmap block %llu in "
>>>        476                              "block_group %u", bitmap_blk, block_group);
>>>        477                   ext4_mark_group_bitmap_corrupted(sb, block_group,
>>>        478                                           EXT4_GROUP_INFO_BBITMAP_CORRUPT);
>>>        479                   return ERR_PTR(-EFSCORRUPTED);
>>>        480           }
>>>
>>> and the last error is here:
>>>
>>>       608           if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
>>>    *  609                   ret = check_block_validity(inode, map);
>>>       610                   if (ret != 0)
>>>       611                           return ret;
>>>       612           }
>>>
>>>
>>> And indeed we have the traces of the first error in dmesg:
>>>
>>> [75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
>>> [75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
>>> [75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
>>>
>>> However, the last error seems strange. It seems like check_block_validity
>>> should ideally never fail for a journal inode.Unfortunately, sbi->s_es page is
>>> not recorded in the crash dump for some reason so idk the exact value at the
>>> time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
>>> which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
>>> I'll look a bit more into the proc01 ltp to see if we can recreate the failure
>>> to get more info.
>> Right, check_block_validity() skips the journal inode check. If
>> the journal inode check fails, that shows s_es->s_journal_inum and
>> journal->j_inode->i_ino are different. The file system doesn't modify
>> s_journal_inum, so it should be modified by some other writer bypassing
>> the file system (i.e. writing to bare disk).
>>
>> If that's how it is, we could avoid this issue by using EXT4_JOURNAL_INO
>> directly or saving s_journal_inum to ext4_sb_info (which offers better
>> compatibility).
> Hey Baokun,
>
> So I'm wondering if modifying the check in __check_block_validity the
> right thing to do. The fact that something has changed the on disk
> journal inode number is itself concerning and maybe stopping IO here is
> the right thing to do instead of using a cached journal inode number and
> silently ignoring the issue.
Because the cached journal inode is fine (it was checked when mounting),
and the data we're committing is good too, I think it's okay to keep
committing.  The actual problem is someone modified the superblock
directly, bypassing the file system, and the file system can't protect
against that.
>
> Anyways now that the journal inode is corrupted on disk, will the
> recovery tooling even be able to read the journal anymore?
Actually, the journal inode on disk could still be okay. This just tells us
s_es->s_journal_inum is abnormal, meaning part of the superblock on disk
got changed. If only the superblock was modified, we can use the backup
superblock or grab the latest superblock copy from the journal.


Cheers,
Baokun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-27 12:51           ` Baokun Li
@ 2025-03-03 17:30             ` Ojaswin Mujoo
  0 siblings, 0 replies; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-03-03 17:30 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, linux-kernel,
	Mahesh Kumar, Ritesh Harjani, Yang Erkun

On Thu, Feb 27, 2025 at 08:51:11PM +0800, Baokun Li wrote:
> On 2025/2/27 19:43, Ojaswin Mujoo wrote:
> > On Wed, Feb 26, 2025 at 09:55:52AM +0800, Baokun Li wrote:
> > > On 2025/2/25 20:06, Ojaswin Mujoo wrote:
> > > > On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
> > > > > On 2025/2/22 16:40, Ojaswin Mujoo wrote:
> > > > > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > > > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > > > > running stress tests it was observed that in case of some error handling
> > > > > > paths, it is possible for update_super_work to start a transaction after
> > > > > > the journal is destroyed eg:
> > > > > > 
> > > > > > (umount)
> > > > > > ext4_kill_sb
> > > > > >      kill_block_super
> > > > > >        generic_shutdown_super
> > > > > >          sync_filesystem /* commits all txns */
> > > > > >          evict_inodes
> > > > > >            /* might start a new txn */
> > > > > >          ext4_put_super
> > > > > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > > > >            jbd2_journal_destroy
> > > > > >              journal_kill_thread
> > > > > >                journal->j_flags |= JBD2_UNMOUNT;
> > > > > >              jbd2_journal_commit_transaction
> > > > > >                jbd2_journal_get_descriptor_buffer
> > > > > >                  jbd2_journal_bmap
> > > > > >                    ext4_journal_bmap
> > > > > >                      ext4_map_blocks
> > > > > >                        ...
> > > > > >                        ext4_inode_error
> > > > > Just curious, since jbd2_journal_bmap() only queries the map and does not
> > > > > create it, how does it fail here? Is there more information in dmesg?
> > > > > Is s_journal_inum normal after file system corruption?
> > > > Hey Baokun,
> > > Hello Ojaswin,
> > > 
> > > Thanks for your detailed explanation!
> > > > So I dug a bit more into the vmcore. The error information in sbi looks
> > > > like this:
> > > > 
> > > >     s_add_error_count = 1,
> > > >     s_first_error_code = 117,
> > > >     s_first_error_line = 475,
> > > >     s_first_error_ino = 0,
> > > >     s_first_error_block = 0,
> > > >     s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
> > > >     s_first_error_time = 1737023235,
> > > > 
> > > >     s_last_error_code = 117,
> > > >     s_last_error_line = 609,
> > > >     s_last_error_ino = 8,
> > > >     s_last_error_block = 783,
> > > >     s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
> > > >     s_last_error_time = 1737023236,
> > > > 
> > > >     The first error is here:
> > > > 
> > > >         if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
> > > >        474               (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
> > > >     *  475                   ext4_error(sb, "Invalid block bitmap block %llu in "
> > > >        476                              "block_group %u", bitmap_blk, block_group);
> > > >        477                   ext4_mark_group_bitmap_corrupted(sb, block_group,
> > > >        478                                           EXT4_GROUP_INFO_BBITMAP_CORRUPT);
> > > >        479                   return ERR_PTR(-EFSCORRUPTED);
> > > >        480           }
> > > > 
> > > > and the last error is here:
> > > > 
> > > >       608           if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> > > >    *  609                   ret = check_block_validity(inode, map);
> > > >       610                   if (ret != 0)
> > > >       611                           return ret;
> > > >       612           }
> > > > 
> > > > 
> > > > And indeed we have the traces of the first error in dmesg:
> > > > 
> > > > [75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > > > [75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > > > [75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > > > 
> > > > However, the last error seems strange. It seems like check_block_validity
> > > > should ideally never fail for a journal inode.Unfortunately, sbi->s_es page is
> > > > not recorded in the crash dump for some reason so idk the exact value at the
> > > > time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
> > > > which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
> > > > I'll look a bit more into the proc01 ltp to see if we can recreate the failure
> > > > to get more info.
> > > Right, check_block_validity() skips the journal inode check. If
> > > the journal inode check fails, that shows s_es->s_journal_inum and
> > > journal->j_inode->i_ino are different. The file system doesn't modify
> > > s_journal_inum, so it should be modified by some other writer bypassing
> > > the file system (i.e. writing to bare disk).
> > > 
> > > If that's how it is, we could avoid this issue by using EXT4_JOURNAL_INO
> > > directly or saving s_journal_inum to ext4_sb_info (which offers better
> > > compatibility).
> > Hey Baokun,
> > 
> > So I'm wondering if modifying the check in __check_block_validity the
> > right thing to do. The fact that something has changed the on disk
> > journal inode number is itself concerning and maybe stopping IO here is
> > the right thing to do instead of using a cached journal inode number and
> > silently ignoring the issue.
> Because the cached journal inode is fine (it was checked when mounting),
> and the data we're committing is good too, I think it's okay to keep
> committing.  The actual problem is someone modified the superblock
> directly, bypassing the file system, and the file system can't protect
> against that.
> > 
> > Anyways now that the journal inode is corrupted on disk, will the
> > recovery tooling even be able to read the journal anymore?
> Actually, the journal inode on disk could still be okay. This just tells us
> s_es->s_journal_inum is abnormal, meaning part of the superblock on disk
> got changed. If only the superblock was modified, we can use the backup
> superblock or grab the latest superblock copy from the journal.

Ahh right thats true idk why I missed that. Although it's very unlikely 
someone would have modified/corrupted the journal inode num in sb in the
tests we are running (mostly syscall tests in ltp), but yes your
suggestion does make sense. I'll try to look into it and add it in the
next revision.

Regards,
ojaswin

> 
> 
> Cheers,
> Baokun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-02-27 11:20     ` Ojaswin Mujoo
@ 2025-03-04 10:25       ` Jan Kara
  2025-03-04 12:54         ` Ojaswin Mujoo
  2025-03-05  4:14         ` Ritesh Harjani
  0 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2025-03-04 10:25 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, linux-ext4, Theodore Ts'o, linux-kernel,
	Mahesh Kumar, Ritesh Harjani

On Thu 27-02-25 16:50:22, Ojaswin Mujoo wrote:
> On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote:
> > On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
> > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > running stress tests it was observed that in case of some error handling
> > > paths, it is possible for update_super_work to start a transaction after
> > > the journal is destroyed eg:
> > > 
> > > (umount)
> > > ext4_kill_sb
> > >   kill_block_super
> > >     generic_shutdown_super
> > >       sync_filesystem /* commits all txns */
> > >       evict_inodes
> > >         /* might start a new txn */
> > >       ext4_put_super
> > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > >         jbd2_journal_destroy
> > >           journal_kill_thread
> > >             journal->j_flags |= JBD2_UNMOUNT;
> > >           jbd2_journal_commit_transaction
> > >             jbd2_journal_get_descriptor_buffer
> > >               jbd2_journal_bmap
> > >                 ext4_journal_bmap
> > >                   ext4_map_blocks
> > >                     ...
> > >                     ext4_inode_error
> > >                       ext4_handle_error
> > >                         schedule_work(&sbi->s_sb_upd_work)
> > > 
> > >                                                /* work queue kicks in */
> > >                                                update_super_work
> > >                                                  jbd2_journal_start
> > >                                                    start_this_handle
> > >                                                      BUG_ON(journal->j_flags &
> > >                                                             JBD2_UNMOUNT)
> > > 
> > > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > > active.  Otherwise, just fallback to an un-journaled commit.
> > > 
> > > The important thing to note here is that we must only defer sb update if
> > > we have not yet flushed the s_sb_update_work queue in umount path else
> > > this race can be hit (point 1 below). Since we don't have a direct way
> > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > > subtle so adding some notes below for future reference:
> > > 
> > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > > == 0) however this is not correct since we could end up scheduling work
> > > after it has been flushed:
> > > 
> > >  ext4_put_super
> > >   flush_work(&sbi->s_sb_upd_work)
> > > 
> > >                            **kjournald2**
> > >                            jbd2_journal_commit_transaction
> > >                            ...
> > >                            ext4_inode_error
> > >                              /* JBD2_UNMOUNT not set */
> > >                              schedule_work(s_sb_upd_work)
> > > 
> > >    jbd2_journal_destroy
> > >     journal->j_flags |= JBD2_UNMOUNT;
> > > 
> > >                                       **workqueue**
> > >                                       update_super_work
> > >                                        jbd2_journal_start
> > >                                         start_this_handle
> > >                                           BUG_ON(JBD2_UNMOUNT)
> > > 
> > > Something like the above doesn't happen with SB_ACTIVE check because we
> > > are sure that the workqueue would be flushed at a later point if we are
> > > in the umount path.
> > > 
> > > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > > only called from mballoc and AFAICT it would be always valid to schedule
> > > work here.
> > > 
> > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > > Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > 
> > Good catch! But I think the solution will have to be slightly different.
> > Basing the check on SB_ACTIVE has the problem that you can have racing
> > updates of the sb in the still running transaction and in your direct
> > update leading to inconsistencies after a crash (that was the reason why
> > we've created the s_sb_upd_work in the first place).
> > 
> > I would solve this by implementing something like
> > ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
> > workqueue, and then destroy the journal. And ext4_handle_error() will check
> > for the sbi flag.
> > 
> > 								Honza
> 
> Hey Jan,
> 
> Thanks for the review. So earlier I did go through different code paths to see
> if we will have a direct sb write clash with a journalled one it wouldn't but,
> relooking at it, seems like we might have a scenario as follows:
> 
> generic_super_shutdown
>  sync_filesytems
>   /* running txns committed. executing ext4_journal_commit_callback */
>   ext4_maybe_update_superblock
>    /* schedules work */
>    schedule_work(&sbi->s_sb_upd_work)
>                                           update_super_work
>                                           /* start a txn and add sb to it */
>  sb->s_flags &= ~SB_ACTIVE;
>  evict_inode
>    ext4_evict_inode
>     ext4_std_error
>      ext4_handle_error
>       /* direct commit of sb (Not good!) */
> 
> 
> Now with the 'setting the flag in sbi' approach, I'm not sure if that will be
> enough to handle this as well. For example, if we add a flag like
> sbi->s_journal_destroying, then:
> 
> ext4_put_super
>  sbi->s_journal_destroying = true
>  flush_workqueue()
>   /* sb is now journalled */
>  jbd2_journal_destory
>   jbd2_journal_commit_transaction
>    /* add tag for sb in descriptor and add buffer to wbufs[] */
>    /* Later from some other buffer in the txn: */
>    jbd2_journal_next_log_block
>     /* hits error in ext4_journal_bmap */
>     ext4_handle_error
>       sbi->s_journal_destroying == true
>       /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */
>    jbd2_journal_abort
>    /* after abort everything in wbufs[] is written to journal */
> 
> In the above we will have a checksum mismatch but then maybe its not really
> an issue. Maybe since we never commit the txn it is understood that the contents
> can't be trusted and it should be fine to have a mismatch b/w the decriptor tag
> and the actual super block contents? In which case the sbi flag approach should
> be fine.
> 
> Does my understanding sound correct?

Yes. Since the transaction does not get committed, its contents will be
(and must be) ignored. So although you are correct that the superblock
content in the transaction need to match the content we write directly, it
does not matter because whatever is in the uncommitted transaction must
never be written to the final position on disk.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-03-04 10:25       ` Jan Kara
@ 2025-03-04 12:54         ` Ojaswin Mujoo
  2025-03-05  4:14         ` Ritesh Harjani
  1 sibling, 0 replies; 16+ messages in thread
From: Ojaswin Mujoo @ 2025-03-04 12:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, linux-kernel, Mahesh Kumar,
	Ritesh Harjani

On Tue, Mar 04, 2025 at 11:25:02AM +0100, Jan Kara wrote:
> On Thu 27-02-25 16:50:22, Ojaswin Mujoo wrote:
> > On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote:
> > > On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
> > > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > > running stress tests it was observed that in case of some error handling
> > > > paths, it is possible for update_super_work to start a transaction after
> > > > the journal is destroyed eg:
> > > > 
> > > > (umount)
> > > > ext4_kill_sb
> > > >   kill_block_super
> > > >     generic_shutdown_super
> > > >       sync_filesystem /* commits all txns */
> > > >       evict_inodes
> > > >         /* might start a new txn */
> > > >       ext4_put_super
> > > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > >         jbd2_journal_destroy
> > > >           journal_kill_thread
> > > >             journal->j_flags |= JBD2_UNMOUNT;
> > > >           jbd2_journal_commit_transaction
> > > >             jbd2_journal_get_descriptor_buffer
> > > >               jbd2_journal_bmap
> > > >                 ext4_journal_bmap
> > > >                   ext4_map_blocks
> > > >                     ...
> > > >                     ext4_inode_error
> > > >                       ext4_handle_error
> > > >                         schedule_work(&sbi->s_sb_upd_work)
> > > > 
> > > >                                                /* work queue kicks in */
> > > >                                                update_super_work
> > > >                                                  jbd2_journal_start
> > > >                                                    start_this_handle
> > > >                                                      BUG_ON(journal->j_flags &
> > > >                                                             JBD2_UNMOUNT)
> > > > 
> > > > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > > > active.  Otherwise, just fallback to an un-journaled commit.
> > > > 
> > > > The important thing to note here is that we must only defer sb update if
> > > > we have not yet flushed the s_sb_update_work queue in umount path else
> > > > this race can be hit (point 1 below). Since we don't have a direct way
> > > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > > > subtle so adding some notes below for future reference:
> > > > 
> > > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > > > == 0) however this is not correct since we could end up scheduling work
> > > > after it has been flushed:
> > > > 
> > > >  ext4_put_super
> > > >   flush_work(&sbi->s_sb_upd_work)
> > > > 
> > > >                            **kjournald2**
> > > >                            jbd2_journal_commit_transaction
> > > >                            ...
> > > >                            ext4_inode_error
> > > >                              /* JBD2_UNMOUNT not set */
> > > >                              schedule_work(s_sb_upd_work)
> > > > 
> > > >    jbd2_journal_destroy
> > > >     journal->j_flags |= JBD2_UNMOUNT;
> > > > 
> > > >                                       **workqueue**
> > > >                                       update_super_work
> > > >                                        jbd2_journal_start
> > > >                                         start_this_handle
> > > >                                           BUG_ON(JBD2_UNMOUNT)
> > > > 
> > > > Something like the above doesn't happen with SB_ACTIVE check because we
> > > > are sure that the workqueue would be flushed at a later point if we are
> > > > in the umount path.
> > > > 
> > > > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > > > only called from mballoc and AFAICT it would be always valid to schedule
> > > > work here.
> > > > 
> > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > > > Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > 
> > > Good catch! But I think the solution will have to be slightly different.
> > > Basing the check on SB_ACTIVE has the problem that you can have racing
> > > updates of the sb in the still running transaction and in your direct
> > > update leading to inconsistencies after a crash (that was the reason why
> > > we've created the s_sb_upd_work in the first place).
> > > 
> > > I would solve this by implementing something like
> > > ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
> > > workqueue, and then destroy the journal. And ext4_handle_error() will check
> > > for the sbi flag.
> > > 
> > > 								Honza
> > 
> > Hey Jan,
> > 
> > Thanks for the review. So earlier I did go through different code paths to see
> > if we will have a direct sb write clash with a journalled one it wouldn't but,
> > relooking at it, seems like we might have a scenario as follows:
> > 
> > generic_super_shutdown
> >  sync_filesytems
> >   /* running txns committed. executing ext4_journal_commit_callback */
> >   ext4_maybe_update_superblock
> >    /* schedules work */
> >    schedule_work(&sbi->s_sb_upd_work)
> >                                           update_super_work
> >                                           /* start a txn and add sb to it */
> >  sb->s_flags &= ~SB_ACTIVE;
> >  evict_inode
> >    ext4_evict_inode
> >     ext4_std_error
> >      ext4_handle_error
> >       /* direct commit of sb (Not good!) */
> > 
> > 
> > Now with the 'setting the flag in sbi' approach, I'm not sure if that will be
> > enough to handle this as well. For example, if we add a flag like
> > sbi->s_journal_destroying, then:
> > 
> > ext4_put_super
> >  sbi->s_journal_destroying = true
> >  flush_workqueue()
> >   /* sb is now journalled */
> >  jbd2_journal_destory
> >   jbd2_journal_commit_transaction
> >    /* add tag for sb in descriptor and add buffer to wbufs[] */
> >    /* Later from some other buffer in the txn: */
> >    jbd2_journal_next_log_block
> >     /* hits error in ext4_journal_bmap */
> >     ext4_handle_error
> >       sbi->s_journal_destroying == true
> >       /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */
> >    jbd2_journal_abort
> >    /* after abort everything in wbufs[] is written to journal */
> > 
> > In the above we will have a checksum mismatch but then maybe its not really
> > an issue. Maybe since we never commit the txn it is understood that the contents
> > can't be trusted and it should be fine to have a mismatch b/w the decriptor tag
> > and the actual super block contents? In which case the sbi flag approach should
> > be fine.
> > 
> > Does my understanding sound correct?
> 
> Yes. Since the transaction does not get committed, its contents will be
> (and must be) ignored. So although you are correct that the superblock
> content in the transaction need to match the content we write directly, it
> does not matter because whatever is in the uncommitted transaction must
> never be written to the final position on disk.
> 
> 								Honza

Got it, thanks for the review. I'll address this in v2

Regards,
ojaswin
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
  2025-03-04 10:25       ` Jan Kara
  2025-03-04 12:54         ` Ojaswin Mujoo
@ 2025-03-05  4:14         ` Ritesh Harjani
  1 sibling, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2025-03-05  4:14 UTC (permalink / raw)
  To: Jan Kara, Ojaswin Mujoo
  Cc: Jan Kara, linux-ext4, Theodore Ts'o, linux-kernel,
	Mahesh Kumar

Jan Kara <jack@suse.cz> writes:

> On Thu 27-02-25 16:50:22, Ojaswin Mujoo wrote:
>> On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote:
>> > On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
>> > > Presently we always BUG_ON if trying to start a transaction on a journal
>> > > marked with JBD2_UNMOUNT, since this should never happen. However while
>> > > running stress tests it was observed that in case of some error handling
>> > > paths, it is possible for update_super_work to start a transaction after
>> > > the journal is destroyed eg:
>> > > 
>> > > (umount)
>> > > ext4_kill_sb
>> > >   kill_block_super
>> > >     generic_shutdown_super
>> > >       sync_filesystem /* commits all txns */
>> > >       evict_inodes
>> > >         /* might start a new txn */
>> > >       ext4_put_super
>> > > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>> > >         jbd2_journal_destroy
>> > >           journal_kill_thread
>> > >             journal->j_flags |= JBD2_UNMOUNT;
>> > >           jbd2_journal_commit_transaction
>> > >             jbd2_journal_get_descriptor_buffer
>> > >               jbd2_journal_bmap
>> > >                 ext4_journal_bmap
>> > >                   ext4_map_blocks
>> > >                     ...
>> > >                     ext4_inode_error
>> > >                       ext4_handle_error
>> > >                         schedule_work(&sbi->s_sb_upd_work)
>> > > 
>> > >                                                /* work queue kicks in */
>> > >                                                update_super_work
>> > >                                                  jbd2_journal_start
>> > >                                                    start_this_handle
>> > >                                                      BUG_ON(journal->j_flags &
>> > >                                                             JBD2_UNMOUNT)
>> > > 
>> > > Hence, make sure we only defer the update of ext4 sb if the sb is still
>> > > active.  Otherwise, just fallback to an un-journaled commit.
>> > > 
>> > > The important thing to note here is that we must only defer sb update if
>> > > we have not yet flushed the s_sb_update_work queue in umount path else
>> > > this race can be hit (point 1 below). Since we don't have a direct way
>> > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
>> > > subtle so adding some notes below for future reference:
>> > > 
>> > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
>> > > == 0) however this is not correct since we could end up scheduling work
>> > > after it has been flushed:
>> > > 
>> > >  ext4_put_super
>> > >   flush_work(&sbi->s_sb_upd_work)
>> > > 
>> > >                            **kjournald2**
>> > >                            jbd2_journal_commit_transaction
>> > >                            ...
>> > >                            ext4_inode_error
>> > >                              /* JBD2_UNMOUNT not set */
>> > >                              schedule_work(s_sb_upd_work)
>> > > 
>> > >    jbd2_journal_destroy
>> > >     journal->j_flags |= JBD2_UNMOUNT;
>> > > 
>> > >                                       **workqueue**
>> > >                                       update_super_work
>> > >                                        jbd2_journal_start
>> > >                                         start_this_handle
>> > >                                           BUG_ON(JBD2_UNMOUNT)
>> > > 
>> > > Something like the above doesn't happen with SB_ACTIVE check because we
>> > > are sure that the workqueue would be flushed at a later point if we are
>> > > in the umount path.
>> > > 
>> > > 2. We don't need a similar check in ext4_grp_locked_error since it is
>> > > only called from mballoc and AFAICT it would be always valid to schedule
>> > > work here.
>> > > 
>> > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>> > > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>> > > Suggested-by: Ritesh Harjani <ritesh.list@gmail.com>
>> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> > 
>> > Good catch! But I think the solution will have to be slightly different.
>> > Basing the check on SB_ACTIVE has the problem that you can have racing
>> > updates of the sb in the still running transaction and in your direct
>> > update leading to inconsistencies after a crash (that was the reason why
>> > we've created the s_sb_upd_work in the first place).
>> > 
>> > I would solve this by implementing something like
>> > ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
>> > workqueue, and then destroy the journal. And ext4_handle_error() will check
>> > for the sbi flag.
>> > 
>> > 								Honza
>> 
>> Hey Jan,
>> 
>> Thanks for the review. So earlier I did go through different code paths to see
>> if we will have a direct sb write clash with a journalled one it wouldn't but,
>> relooking at it, seems like we might have a scenario as follows:
>> 
>> generic_super_shutdown
>>  sync_filesytems
>>   /* running txns committed. executing ext4_journal_commit_callback */
>>   ext4_maybe_update_superblock
>>    /* schedules work */
>>    schedule_work(&sbi->s_sb_upd_work)
>>                                           update_super_work
>>                                           /* start a txn and add sb to it */
>>  sb->s_flags &= ~SB_ACTIVE;
>>  evict_inode
>>    ext4_evict_inode
>>     ext4_std_error
>>      ext4_handle_error
>>       /* direct commit of sb (Not good!) */
>> 

Ohk. So even after clearing SB_ACTIVE flag, we still have FS operations
running like ext4_evict_inode which can race with the super block update
work starting a txn. 

Thanks for catching that scenario.


>> 
>> Now with the 'setting the flag in sbi' approach, I'm not sure if that will be
>> enough to handle this as well. For example, if we add a flag like
>> sbi->s_journal_destroying, then:
>> 
>> ext4_put_super
>>  sbi->s_journal_destroying = true
>>  flush_workqueue()
>>   /* sb is now journalled */
>>  jbd2_journal_destory
>>   jbd2_journal_commit_transaction
>>    /* add tag for sb in descriptor and add buffer to wbufs[] */
>>    /* Later from some other buffer in the txn: */
>>    jbd2_journal_next_log_block
>>     /* hits error in ext4_journal_bmap */
>>     ext4_handle_error
>>       sbi->s_journal_destroying == true
>>       /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */
>>    jbd2_journal_abort
>>    /* after abort everything in wbufs[] is written to journal */
>> 
>> In the above we will have a checksum mismatch but then maybe its not really
>> an issue. Maybe since we never commit the txn it is understood that the contents
>> can't be trusted and it should be fine to have a mismatch b/w the decriptor tag
>> and the actual super block contents? In which case the sbi flag approach should
>> be fine.
>> 
>> Does my understanding sound correct?
>
> Yes. Since the transaction does not get committed, its contents will be
> (and must be) ignored. So although you are correct that the superblock
> content in the transaction need to match the content we write directly, it
> does not matter because whatever is in the uncommitted transaction must
> never be written to the final position on disk.
>
> 								Honza

Thanks Jan for the suggestion. I now see what you meant by having an sbi
flag. Since that is local to ext4, we can set it just before flushing
the workqueue and we know that there won't be any update super block
work which can start a txn after flushing is complete. This flag can
then be used to check in any of the error handling paths to decide on
whether to schedule a new update work or not for updating sb.

This makes sense. Thanks for the review!

-ritesh

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-03-05  4:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-22  8:40 [PATCH 0/2] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
2025-02-22  8:40 ` [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE Ojaswin Mujoo
2025-02-24 14:52   ` Jan Kara
2025-02-27 11:20     ` Ojaswin Mujoo
2025-03-04 10:25       ` Jan Kara
2025-03-04 12:54         ` Ojaswin Mujoo
2025-03-05  4:14         ` Ritesh Harjani
2025-02-25  1:53   ` Baokun Li
2025-02-25 12:06     ` Ojaswin Mujoo
2025-02-26  1:55       ` Baokun Li
2025-02-27 11:43         ` Ojaswin Mujoo
2025-02-27 12:51           ` Baokun Li
2025-03-03 17:30             ` Ojaswin Mujoo
2025-02-22  8:40 ` [PATCH 2/2] ext4: Make sb update interval tunable Ojaswin Mujoo
2025-02-22  9:32   ` Ritesh Harjani (IBM)
2025-02-24 14:27   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).