[PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues
@ 2023-06-06 13:59 Zhang Yi
  2023-06-06 13:59 ` [PATCH v3 1/6] jbd2: recheck chechpointing non-dirty buffer Zhang Yi
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Zhang Yi @ 2023-06-06 13:59 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3,
	chengzhihao1

From: Zhang Yi <yi.zhang@huawei.com>

v2->v3:
 - Init released parameter in journal_shrink_one_cp_list() instead of
   __jbd2_journal_clean_checkpoint_list() in patch 3.
 - Fix a comment in patch 5.
v1->v2:
 - Drop the last patch in [1].
 - Merge journal_clean_one_cp_list() into journal_shrink_one_cp_list().
 - Fix the race issues through trying to hold buffer lock and check
   dirty state under the lock.
 - Append a cleanup patch, remove __journal_try_to_free_buffer().

Hello,

The first two patches are from [1] and are not changed, appending
another four (it depends on the first three) to fix another three race
issues in the checkpoint procedure which could also lead to inconsistent
results.

[1] https://lore.kernel.org/linux-ext4/20230516020226.2813588-1-yi.zhang@huaweicloud.com/

Thanks,
Yi.

Zhang Yi (5):
  jbd2: recheck chechpointing non-dirty buffer
  jbd2: remove t_checkpoint_io_list
  jbd2: remove journal_clean_one_cp_list()
  jbd2: fix a race when checking checkpoint buffer busy
  jbd2: remove __journal_try_to_free_buffer()

Zhihao Cheng (1):
  jbd2: Fix wrongly judgement for buffer head removing while doing
    checkpoint

 fs/jbd2/checkpoint.c        | 277 ++++++++++++------------------------
 fs/jbd2/commit.c            |   3 +-
 fs/jbd2/transaction.c       |  40 ++----
 include/linux/jbd2.h        |   7 +-
 include/trace/events/jbd2.h |  12 +-
 5 files changed, 108 insertions(+), 231 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/6] jbd2: recheck chechpointing non-dirty buffer
  2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
@ 2023-06-06 13:59 ` Zhang Yi
  2023-06-06 13:59 ` [PATCH v3 2/6] jbd2: remove t_checkpoint_io_list Zhang Yi
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2023-06-06 13:59 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3,
	chengzhihao1

From: Zhang Yi <yi.zhang@huawei.com>

There is a long-standing metadata corruption issue that happens from
time to time, but it's very difficult to reproduce and analyse, benefit
from the JBD2_CYCLE_RECORD option, we found out that the problem is the
checkpointing process miss to write out some buffers which are raced by
another do_get_write_access(). Looks below for detail.

jbd2_log_do_checkpoint() //transaction X
 //buffer A is dirty and not belones to any transaction
 __buffer_relink_io() //move it to the IO list
 __flush_batch()
  write_dirty_buffer()
                             do_get_write_access()
                             clear_buffer_dirty
                             __jbd2_journal_file_buffer()
                             //add buffer A to a new transaction Y
   lock_buffer(bh)
   //doesn't write out
 __jbd2_journal_remove_checkpoint()
 //finish checkpoint except buffer A
 //filesystem corrupt if the new transaction Y isn't fully write out.

Due to the t_checkpoint_list walking loop in jbd2_log_do_checkpoint()
have already handles waiting for buffers under IO and re-added new
transaction to complete commit, and it also removing cleaned buffers,
this makes sure the list will eventually get empty. So it's fine to
leave buffers on the t_checkpoint_list while flushing out and completely
stop using the t_checkpoint_io_list.

Cc: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 102 ++++++++++++-------------------------------
 1 file changed, 29 insertions(+), 73 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 51bd38da21cd..25e3c20eb19f 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -57,28 +57,6 @@ static inline void __buffer_unlink(struct journal_head *jh)
 	}
 }
 
-/*
- * Move a buffer from the checkpoint list to the checkpoint io list
- *
- * Called with j_list_lock held
- */
-static inline void __buffer_relink_io(struct journal_head *jh)
-{
-	transaction_t *transaction = jh->b_cp_transaction;
-
-	__buffer_unlink_first(jh);
-
-	if (!transaction->t_checkpoint_io_list) {
-		jh->b_cpnext = jh->b_cpprev = jh;
-	} else {
-		jh->b_cpnext = transaction->t_checkpoint_io_list;
-		jh->b_cpprev = transaction->t_checkpoint_io_list->b_cpprev;
-		jh->b_cpprev->b_cpnext = jh;
-		jh->b_cpnext->b_cpprev = jh;
-	}
-	transaction->t_checkpoint_io_list = jh;
-}
-
 /*
  * Check a checkpoint buffer could be release or not.
  *
@@ -183,6 +161,7 @@ __flush_batch(journal_t *journal, int *batch_count)
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
 		BUFFER_TRACE(bh, "brelse");
 		__brelse(bh);
+		journal->j_chkpt_bhs[i] = NULL;
 	}
 	*batch_count = 0;
 }
@@ -242,6 +221,11 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 		jh = transaction->t_checkpoint_list;
 		bh = jh2bh(jh);
 
+		/*
+		 * The buffer may be writing back, or flushing out in the
+		 * last couple of cycles, or re-adding into a new transaction,
+		 * need to check it again until it's unlocked.
+		 */
 		if (buffer_locked(bh)) {
 			get_bh(bh);
 			spin_unlock(&journal->j_list_lock);
@@ -287,28 +271,32 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 		}
 		if (!buffer_dirty(bh)) {
 			BUFFER_TRACE(bh, "remove from checkpoint");
-			if (__jbd2_journal_remove_checkpoint(jh))
-				/* The transaction was released; we're done */
+			/*
+			 * If the transaction was released or the checkpoint
+			 * list was empty, we're done.
+			 */
+			if (__jbd2_journal_remove_checkpoint(jh) ||
+			    !transaction->t_checkpoint_list)
 				goto out;
-			continue;
+		} else {
+			/*
+			 * We are about to write the buffer, it could be
+			 * raced by some other transaction shrink or buffer
+			 * re-log logic once we release the j_list_lock,
+			 * leave it on the checkpoint list and check status
+			 * again to make sure it's clean.
+			 */
+			BUFFER_TRACE(bh, "queue");
+			get_bh(bh);
+			J_ASSERT_BH(bh, !buffer_jwrite(bh));
+			journal->j_chkpt_bhs[batch_count++] = bh;
+			transaction->t_chp_stats.cs_written++;
+			transaction->t_checkpoint_list = jh->b_cpnext;
 		}
-		/*
-		 * Important: we are about to write the buffer, and
-		 * possibly block, while still holding the journal
-		 * lock.  We cannot afford to let the transaction
-		 * logic start messing around with this buffer before
-		 * we write it to disk, as that would break
-		 * recoverability.
-		 */
-		BUFFER_TRACE(bh, "queue");
-		get_bh(bh);
-		J_ASSERT_BH(bh, !buffer_jwrite(bh));
-		journal->j_chkpt_bhs[batch_count++] = bh;
-		__buffer_relink_io(jh);
-		transaction->t_chp_stats.cs_written++;
+
 		if ((batch_count == JBD2_NR_BATCH) ||
-		    need_resched() ||
-		    spin_needbreak(&journal->j_list_lock))
+		    need_resched() || spin_needbreak(&journal->j_list_lock) ||
+		    jh2bh(transaction->t_checkpoint_list) == journal->j_chkpt_bhs[0])
 			goto unlock_and_flush;
 	}
 
@@ -322,38 +310,6 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 			goto restart;
 	}
 
-	/*
-	 * Now we issued all of the transaction's buffers, let's deal
-	 * with the buffers that are out for I/O.
-	 */
-restart2:
-	/* Did somebody clean up the transaction in the meanwhile? */
-	if (journal->j_checkpoint_transactions != transaction ||
-	    transaction->t_tid != this_tid)
-		goto out;
-
-	while (transaction->t_checkpoint_io_list) {
-		jh = transaction->t_checkpoint_io_list;
-		bh = jh2bh(jh);
-		if (buffer_locked(bh)) {
-			get_bh(bh);
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			/* the journal_head may have gone by now */
-			BUFFER_TRACE(bh, "brelse");
-			__brelse(bh);
-			spin_lock(&journal->j_list_lock);
-			goto restart2;
-		}
-
-		/*
-		 * Now in whatever state the buffer currently is, we
-		 * know that it has been written out and so we can
-		 * drop it from the list
-		 */
-		if (__jbd2_journal_remove_checkpoint(jh))
-			break;
-	}
 out:
 	spin_unlock(&journal->j_list_lock);
 	result = jbd2_cleanup_journal_tail(journal);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/6] jbd2: remove t_checkpoint_io_list
  2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
  2023-06-06 13:59 ` [PATCH v3 1/6] jbd2: recheck chechpointing non-dirty buffer Zhang Yi
@ 2023-06-06 13:59 ` Zhang Yi
  2023-06-06 13:59 ` [PATCH v3 3/6] jbd2: remove journal_clean_one_cp_list() Zhang Yi
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2023-06-06 13:59 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3,
	chengzhihao1

From: Zhang Yi <yi.zhang@huawei.com>

Since t_checkpoint_io_list was stop using in jbd2_log_do_checkpoint()
now, it's time to remove the whole t_checkpoint_io_list logic.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 42 ++----------------------------------------
 fs/jbd2/commit.c     |  3 +--
 include/linux/jbd2.h |  6 ------
 3 files changed, 3 insertions(+), 48 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 25e3c20eb19f..55d6efdbea64 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -27,7 +27,7 @@
  *
  * Called with j_list_lock held.
  */
-static inline void __buffer_unlink_first(struct journal_head *jh)
+static inline void __buffer_unlink(struct journal_head *jh)
 {
 	transaction_t *transaction = jh->b_cp_transaction;
 
@@ -40,23 +40,6 @@ static inline void __buffer_unlink_first(struct journal_head *jh)
 	}
 }
 
-/*
- * Unlink a buffer from a transaction checkpoint(io) list.
- *
- * Called with j_list_lock held.
- */
-static inline void __buffer_unlink(struct journal_head *jh)
-{
-	transaction_t *transaction = jh->b_cp_transaction;
-
-	__buffer_unlink_first(jh);
-	if (transaction->t_checkpoint_io_list == jh) {
-		transaction->t_checkpoint_io_list = jh->b_cpnext;
-		if (transaction->t_checkpoint_io_list == jh)
-			transaction->t_checkpoint_io_list = NULL;
-	}
-}
-
 /*
  * Check a checkpoint buffer could be release or not.
  *
@@ -503,15 +486,6 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
 			break;
 		if (need_resched() || spin_needbreak(&journal->j_list_lock))
 			break;
-		if (released)
-			continue;
-
-		nr_freed += journal_shrink_one_cp_list(transaction->t_checkpoint_io_list,
-						       nr_to_scan, &released);
-		if (*nr_to_scan == 0)
-			break;
-		if (need_resched() || spin_needbreak(&journal->j_list_lock))
-			break;
 	} while (transaction != last_transaction);
 
 	if (transaction != last_transaction) {
@@ -566,17 +540,6 @@ void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy)
 		 */
 		if (need_resched())
 			return;
-		if (ret)
-			continue;
-		/*
-		 * It is essential that we are as careful as in the case of
-		 * t_checkpoint_list with removing the buffer from the list as
-		 * we can possibly see not yet submitted buffers on io_list
-		 */
-		ret = journal_clean_one_cp_list(transaction->
-				t_checkpoint_io_list, destroy);
-		if (need_resched())
-			return;
 		/*
 		 * Stop scanning if we couldn't free the transaction. This
 		 * avoids pointless scanning of transactions which still
@@ -661,7 +624,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
 	jbd2_journal_put_journal_head(jh);
 
 	/* Is this transaction empty? */
-	if (transaction->t_checkpoint_list || transaction->t_checkpoint_io_list)
+	if (transaction->t_checkpoint_list)
 		return 0;
 
 	/*
@@ -753,7 +716,6 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
 	J_ASSERT(transaction->t_forget == NULL);
 	J_ASSERT(transaction->t_shadow_list == NULL);
 	J_ASSERT(transaction->t_checkpoint_list == NULL);
-	J_ASSERT(transaction->t_checkpoint_io_list == NULL);
 	J_ASSERT(atomic_read(&transaction->t_updates) == 0);
 	J_ASSERT(journal->j_committing_transaction != transaction);
 	J_ASSERT(journal->j_running_transaction != transaction);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b33155dd7001..1073259902a6 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -1141,8 +1141,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	spin_lock(&journal->j_list_lock);
 	commit_transaction->t_state = T_FINISHED;
 	/* Check if the transaction can be dropped now that we are finished */
-	if (commit_transaction->t_checkpoint_list == NULL &&
-	    commit_transaction->t_checkpoint_io_list == NULL) {
+	if (commit_transaction->t_checkpoint_list == NULL) {
 		__jbd2_journal_drop_transaction(journal, commit_transaction);
 		jbd2_journal_free_transaction(commit_transaction);
 	}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index f619bae1dcc5..91a2cf4bc575 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -622,12 +622,6 @@ struct transaction_s
 	 */
 	struct journal_head	*t_checkpoint_list;
 
-	/*
-	 * Doubly-linked circular list of all buffers submitted for IO while
-	 * checkpointing. [j_list_lock]
-	 */
-	struct journal_head	*t_checkpoint_io_list;
-
 	/*
 	 * Doubly-linked circular list of metadata buffers being
 	 * shadowed by log IO.  The IO buffers on the iobuf list and
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/6] jbd2: remove journal_clean_one_cp_list()
  2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
  2023-06-06 13:59 ` [PATCH v3 1/6] jbd2: recheck chechpointing non-dirty buffer Zhang Yi
  2023-06-06 13:59 ` [PATCH v3 2/6] jbd2: remove t_checkpoint_io_list Zhang Yi
@ 2023-06-06 13:59 ` Zhang Yi
  2023-06-07  8:30   ` Jan Kara
  2023-06-06 13:59 ` [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint Zhang Yi
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2023-06-06 13:59 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3,
	chengzhihao1

From: Zhang Yi <yi.zhang@huawei.com>

journal_clean_one_cp_list() and journal_shrink_one_cp_list() are almost
the same, so merge them into journal_shrink_one_cp_list(), remove the
nr_to_scan parameter, always scan and try to free the whole checkpoint
list.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/jbd2/checkpoint.c        | 75 +++++++++----------------------------
 include/trace/events/jbd2.h | 12 ++----
 2 files changed, 21 insertions(+), 66 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 55d6efdbea64..b94f847960c2 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -347,50 +347,10 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
 
 /* Checkpoint list management */
 
-/*
- * journal_clean_one_cp_list
- *
- * Find all the written-back checkpoint buffers in the given list and
- * release them. If 'destroy' is set, clean all buffers unconditionally.
- *
- * Called with j_list_lock held.
- * Returns 1 if we freed the transaction, 0 otherwise.
- */
-static int journal_clean_one_cp_list(struct journal_head *jh, bool destroy)
-{
-	struct journal_head *last_jh;
-	struct journal_head *next_jh = jh;
-
-	if (!jh)
-		return 0;
-
-	last_jh = jh->b_cpprev;
-	do {
-		jh = next_jh;
-		next_jh = jh->b_cpnext;
-
-		if (!destroy && __cp_buffer_busy(jh))
-			return 0;
-
-		if (__jbd2_journal_remove_checkpoint(jh))
-			return 1;
-		/*
-		 * This function only frees up some memory
-		 * if possible so we dont have an obligation
-		 * to finish processing. Bail out if preemption
-		 * requested:
-		 */
-		if (need_resched())
-			return 0;
-	} while (jh != last_jh);
-
-	return 0;
-}
-
 /*
  * journal_shrink_one_cp_list
  *
- * Find 'nr_to_scan' written-back checkpoint buffers in the given list
+ * Find all the written-back checkpoint buffers in the given list
  * and try to release them. If the whole transaction is released, set
  * the 'released' parameter. Return the number of released checkpointed
  * buffers.
@@ -398,15 +358,15 @@ static int journal_clean_one_cp_list(struct journal_head *jh, bool destroy)
  * Called with j_list_lock held.
  */
 static unsigned long journal_shrink_one_cp_list(struct journal_head *jh,
-						unsigned long *nr_to_scan,
-						bool *released)
+						bool destroy, bool *released)
 {
 	struct journal_head *last_jh;
 	struct journal_head *next_jh = jh;
 	unsigned long nr_freed = 0;
 	int ret;
 
-	if (!jh || *nr_to_scan == 0)
+	*released = false;
+	if (!jh)
 		return 0;
 
 	last_jh = jh->b_cpprev;
@@ -414,8 +374,7 @@ static unsigned long journal_shrink_one_cp_list(struct journal_head *jh,
 		jh = next_jh;
 		next_jh = jh->b_cpnext;
 
-		(*nr_to_scan)--;
-		if (__cp_buffer_busy(jh))
+		if (!destroy && __cp_buffer_busy(jh))
 			continue;
 
 		nr_freed++;
@@ -427,7 +386,7 @@ static unsigned long journal_shrink_one_cp_list(struct journal_head *jh,
 
 		if (need_resched())
 			break;
-	} while (jh != last_jh && *nr_to_scan);
+	} while (jh != last_jh);
 
 	return nr_freed;
 }
@@ -445,11 +404,11 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
 						  unsigned long *nr_to_scan)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
-	bool released;
+	bool __maybe_unused released;
 	tid_t first_tid = 0, last_tid = 0, next_tid = 0;
 	tid_t tid = 0;
 	unsigned long nr_freed = 0;
-	unsigned long nr_scanned = *nr_to_scan;
+	unsigned long freed;
 
 again:
 	spin_lock(&journal->j_list_lock);
@@ -478,10 +437,11 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
 		tid = transaction->t_tid;
-		released = false;
 
-		nr_freed += journal_shrink_one_cp_list(transaction->t_checkpoint_list,
-						       nr_to_scan, &released);
+		freed = journal_shrink_one_cp_list(transaction->t_checkpoint_list,
+						   false, &released);
+		nr_freed += freed;
+		(*nr_to_scan) -= min(*nr_to_scan, freed);
 		if (*nr_to_scan == 0)
 			break;
 		if (need_resched() || spin_needbreak(&journal->j_list_lock))
@@ -502,9 +462,8 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
 	if (*nr_to_scan && next_tid)
 		goto again;
 out:
-	nr_scanned -= *nr_to_scan;
 	trace_jbd2_shrink_checkpoint_list(journal, first_tid, tid, last_tid,
-					  nr_freed, nr_scanned, next_tid);
+					  nr_freed, next_tid);
 
 	return nr_freed;
 }
@@ -520,7 +479,7 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
 void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
-	int ret;
+	bool released;
 
 	transaction = journal->j_checkpoint_transactions;
 	if (!transaction)
@@ -531,8 +490,8 @@ void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy)
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret = journal_clean_one_cp_list(transaction->t_checkpoint_list,
-						destroy);
+		journal_shrink_one_cp_list(transaction->t_checkpoint_list,
+					   destroy, &released);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
@@ -545,7 +504,7 @@ void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy)
 		 * avoids pointless scanning of transactions which still
 		 * weren't checkpointed.
 		 */
-		if (!ret)
+		if (!released)
 			return;
 	} while (transaction != last_transaction);
 }
diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 8f5ee380d309..5646ae15a957 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -462,11 +462,9 @@ TRACE_EVENT(jbd2_shrink_scan_exit,
 TRACE_EVENT(jbd2_shrink_checkpoint_list,
 
 	TP_PROTO(journal_t *journal, tid_t first_tid, tid_t tid, tid_t last_tid,
-		 unsigned long nr_freed, unsigned long nr_scanned,
-		 tid_t next_tid),
+		 unsigned long nr_freed, tid_t next_tid),
 
-	TP_ARGS(journal, first_tid, tid, last_tid, nr_freed,
-		nr_scanned, next_tid),
+	TP_ARGS(journal, first_tid, tid, last_tid, nr_freed, next_tid),
 
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
@@ -474,7 +472,6 @@ TRACE_EVENT(jbd2_shrink_checkpoint_list,
 		__field(tid_t, tid)
 		__field(tid_t, last_tid)
 		__field(unsigned long, nr_freed)
-		__field(unsigned long, nr_scanned)
 		__field(tid_t, next_tid)
 	),
 
@@ -484,15 +481,14 @@ TRACE_EVENT(jbd2_shrink_checkpoint_list,
 		__entry->tid		= tid;
 		__entry->last_tid	= last_tid;
 		__entry->nr_freed	= nr_freed;
-		__entry->nr_scanned	= nr_scanned;
 		__entry->next_tid	= next_tid;
 	),
 
 	TP_printk("dev %d,%d shrink transaction %u-%u(%u) freed %lu "
-		  "scanned %lu next transaction %u",
+		  "next transaction %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->first_tid, __entry->tid, __entry->last_tid,
-		  __entry->nr_freed, __entry->nr_scanned, __entry->next_tid)
+		  __entry->nr_freed, __entry->next_tid)
 );
 
 #endif /* _TRACE_JBD2_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
                   ` (2 preceding siblings ...)
  2023-06-06 13:59 ` [PATCH v3 3/6] jbd2: remove journal_clean_one_cp_list() Zhang Yi
@ 2023-06-06 13:59 ` Zhang Yi
  2023-06-13  4:31   ` Theodore Ts'o
  2023-06-06 13:59 ` [PATCH v3 5/6] jbd2: fix a race when checking checkpoint buffer busy Zhang Yi
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2023-06-06 13:59 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3,
	chengzhihao1

From: Zhihao Cheng <chengzhihao1@huawei.com>

Following process,

jbd2_journal_commit_transaction
// there are several dirty buffer heads in transaction->t_checkpoint_list
          P1                   wb_workfn
jbd2_log_do_checkpoint
 if (buffer_locked(bh)) // false
                            __block_write_full_page
                             trylock_buffer(bh)
                             test_clear_buffer_dirty(bh)
 if (!buffer_dirty(bh))
  __jbd2_journal_remove_checkpoint(jh)
   if (buffer_write_io_error(bh)) // false
                             >> bh IO error occurs <<
 jbd2_cleanup_journal_tail
  __jbd2_update_log_tail
   jbd2_write_superblock
   // The bh won't be replayed in next mount.
, which could corrupt the ext4 image, fetch a reproducer in [Link].

Since writeback process clears buffer dirty after locking buffer head,
we can fix it by try locking buffer and check dirtiness while buffer is
locked, the buffer head can be removed if it is neither dirty nor locked.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index b94f847960c2..42b34cab64fb 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -204,20 +204,6 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 		jh = transaction->t_checkpoint_list;
 		bh = jh2bh(jh);
 
-		/*
-		 * The buffer may be writing back, or flushing out in the
-		 * last couple of cycles, or re-adding into a new transaction,
-		 * need to check it again until it's unlocked.
-		 */
-		if (buffer_locked(bh)) {
-			get_bh(bh);
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			/* the journal_head may have gone by now */
-			BUFFER_TRACE(bh, "brelse");
-			__brelse(bh);
-			goto retry;
-		}
 		if (jh->b_transaction != NULL) {
 			transaction_t *t = jh->b_transaction;
 			tid_t tid = t->t_tid;
@@ -252,7 +238,22 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 			spin_lock(&journal->j_list_lock);
 			goto restart;
 		}
-		if (!buffer_dirty(bh)) {
+		if (!trylock_buffer(bh)) {
+			/*
+			 * The buffer is locked, it may be writing back, or
+			 * flushing out in the last couple of cycles, or
+			 * re-adding into a new transaction, need to check
+			 * it again until it's unlocked.
+			 */
+			get_bh(bh);
+			spin_unlock(&journal->j_list_lock);
+			wait_on_buffer(bh);
+			/* the journal_head may have gone by now */
+			BUFFER_TRACE(bh, "brelse");
+			__brelse(bh);
+			goto retry;
+		} else if (!buffer_dirty(bh)) {
+			unlock_buffer(bh);
 			BUFFER_TRACE(bh, "remove from checkpoint");
 			/*
 			 * If the transaction was released or the checkpoint
@@ -262,6 +263,7 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 			    !transaction->t_checkpoint_list)
 				goto out;
 		} else {
+			unlock_buffer(bh);
 			/*
 			 * We are about to write the buffer, it could be
 			 * raced by some other transaction shrink or buffer
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 5/6] jbd2: fix a race when checking checkpoint buffer busy
  2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
                   ` (3 preceding siblings ...)
  2023-06-06 13:59 ` [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint Zhang Yi
@ 2023-06-06 13:59 ` Zhang Yi
  2023-06-06 13:59 ` [PATCH v3 6/6] jbd2: remove __journal_try_to_free_buffer() Zhang Yi
  2023-07-12 18:29 ` [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Theodore Ts'o
  6 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2023-06-06 13:59 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3,
	chengzhihao1

From: Zhang Yi <yi.zhang@huawei.com>

Before removing checkpoint buffer from the t_checkpoint_list, we have to
check both BH_Dirty and BH_Lock bits together to distinguish buffers
have not been or were being written back. But __cp_buffer_busy() checks
them separately, it first check lock state and then check dirty, the
window between these two checks could be raced by writing back
procedure, which locks buffer and clears buffer dirty before I/O
completes. So it cannot guarantee checkpointing buffers been written
back to disk if some error happens later. Finally, it may clean
checkpoint transactions and lead to inconsistent filesystem.

jbd2_journal_forget() and __journal_try_to_free_buffer() also have the
same problem (journal_unmap_buffer() escape from this issue since it's
running under the buffer lock), so fix them through introducing a new
helper to try holding the buffer lock and remove really clean buffer.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Cc: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c  | 38 +++++++++++++++++++++++++++++++++++---
 fs/jbd2/transaction.c | 17 +++++------------
 include/linux/jbd2.h  |  1 +
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 42b34cab64fb..9ec91017a7f3 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -376,11 +376,15 @@ static unsigned long journal_shrink_one_cp_list(struct journal_head *jh,
 		jh = next_jh;
 		next_jh = jh->b_cpnext;
 
-		if (!destroy && __cp_buffer_busy(jh))
-			continue;
+		if (destroy) {
+			ret = __jbd2_journal_remove_checkpoint(jh);
+		} else {
+			ret = jbd2_journal_try_remove_checkpoint(jh);
+			if (ret < 0)
+				continue;
+		}
 
 		nr_freed++;
-		ret = __jbd2_journal_remove_checkpoint(jh);
 		if (ret) {
 			*released = true;
 			break;
@@ -616,6 +620,34 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
 	return 1;
 }
 
+/*
+ * Check the checkpoint buffer and try to remove it from the checkpoint
+ * list if it's clean. Returns -EBUSY if it is not clean, returns 1 if
+ * it frees the transaction, 0 otherwise.
+ *
+ * This function is called with j_list_lock held.
+ */
+int jbd2_journal_try_remove_checkpoint(struct journal_head *jh)
+{
+	struct buffer_head *bh = jh2bh(jh);
+
+	if (!trylock_buffer(bh))
+		return -EBUSY;
+	if (buffer_dirty(bh)) {
+		unlock_buffer(bh);
+		return -EBUSY;
+	}
+	unlock_buffer(bh);
+
+	/*
+	 * Buffer is clean and the IO has finished (we held the buffer
+	 * lock) so the checkpoint is done. We can safely remove the
+	 * buffer from this transaction.
+	 */
+	JBUFFER_TRACE(jh, "remove from checkpoint list");
+	return __jbd2_journal_remove_checkpoint(jh);
+}
+
 /*
  * journal_insert_checkpoint: put a committed buffer onto a checkpoint
  * list so that we know when it is safe to clean the transaction out of
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 18611241f451..6ef5022949c4 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1784,8 +1784,7 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
 		 * Otherwise, if the buffer has been written to disk,
 		 * it is safe to remove the checkpoint and drop it.
 		 */
-		if (!buffer_dirty(bh)) {
-			__jbd2_journal_remove_checkpoint(jh);
+		if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
 			spin_unlock(&journal->j_list_lock);
 			goto drop;
 		}
@@ -2112,20 +2111,14 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
 
 	jh = bh2jh(bh);
 
-	if (buffer_locked(bh) || buffer_dirty(bh))
-		goto out;
-
 	if (jh->b_next_transaction != NULL || jh->b_transaction != NULL)
-		goto out;
+		return;
 
 	spin_lock(&journal->j_list_lock);
-	if (jh->b_cp_transaction != NULL) {
-		/* written-back checkpointed metadata buffer */
-		JBUFFER_TRACE(jh, "remove from checkpoint list");
-		__jbd2_journal_remove_checkpoint(jh);
-	}
+	/* Remove written-back checkpointed metadata buffer */
+	if (jh->b_cp_transaction != NULL)
+		jbd2_journal_try_remove_checkpoint(jh);
 	spin_unlock(&journal->j_list_lock);
-out:
 	return;
 }
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 91a2cf4bc575..c212da35a052 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1443,6 +1443,7 @@ extern void jbd2_journal_commit_transaction(journal_t *);
 void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy);
 unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal, unsigned long *nr_to_scan);
 int __jbd2_journal_remove_checkpoint(struct journal_head *);
+int jbd2_journal_try_remove_checkpoint(struct journal_head *jh);
 void jbd2_journal_destroy_checkpoint(journal_t *journal);
 void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 6/6] jbd2: remove __journal_try_to_free_buffer()
  2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
                   ` (4 preceding siblings ...)
  2023-06-06 13:59 ` [PATCH v3 5/6] jbd2: fix a race when checking checkpoint buffer busy Zhang Yi
@ 2023-06-06 13:59 ` Zhang Yi
  2023-07-12 18:29 ` [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Theodore Ts'o
  6 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2023-06-06 13:59 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3,
	chengzhihao1

From: Zhang Yi <yi.zhang@huawei.com>

__journal_try_to_free_buffer() has only one caller and it's logic is
much simple now, so just remove it and open code in
jbd2_journal_try_to_free_buffers().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 31 +++++++------------------------
 1 file changed, 7 insertions(+), 24 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6ef5022949c4..4d1fda1f7143 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2099,29 +2099,6 @@ void jbd2_journal_unfile_buffer(journal_t *journal, struct journal_head *jh)
 	__brelse(bh);
 }
 
-/*
- * Called from jbd2_journal_try_to_free_buffers().
- *
- * Called under jh->b_state_lock
- */
-static void
-__journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
-{
-	struct journal_head *jh;
-
-	jh = bh2jh(bh);
-
-	if (jh->b_next_transaction != NULL || jh->b_transaction != NULL)
-		return;
-
-	spin_lock(&journal->j_list_lock);
-	/* Remove written-back checkpointed metadata buffer */
-	if (jh->b_cp_transaction != NULL)
-		jbd2_journal_try_remove_checkpoint(jh);
-	spin_unlock(&journal->j_list_lock);
-	return;
-}
-
 /**
  * jbd2_journal_try_to_free_buffers() - try to free page buffers.
  * @journal: journal for operation
@@ -2179,7 +2156,13 @@ bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio *folio)
 			continue;
 
 		spin_lock(&jh->b_state_lock);
-		__journal_try_to_free_buffer(journal, bh);
+		if (!jh->b_transaction && !jh->b_next_transaction) {
+			spin_lock(&journal->j_list_lock);
+			/* Remove written-back checkpointed metadata buffer */
+			if (jh->b_cp_transaction != NULL)
+				jbd2_journal_try_remove_checkpoint(jh);
+			spin_unlock(&journal->j_list_lock);
+		}
 		spin_unlock(&jh->b_state_lock);
 		jbd2_journal_put_journal_head(jh);
 		if (buffer_jbd(bh))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/6] jbd2: remove journal_clean_one_cp_list()
  2023-06-06 13:59 ` [PATCH v3 3/6] jbd2: remove journal_clean_one_cp_list() Zhang Yi
@ 2023-06-07  8:30   ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2023-06-07  8:30 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, yukuai3,
	chengzhihao1

On Tue 06-06-23 21:59:25, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> journal_clean_one_cp_list() and journal_shrink_one_cp_list() are almost
> the same, so merge them into journal_shrink_one_cp_list(), remove the
> nr_to_scan parameter, always scan and try to free the whole checkpoint
> list.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/jbd2/checkpoint.c        | 75 +++++++++----------------------------
>  include/trace/events/jbd2.h | 12 ++----
>  2 files changed, 21 insertions(+), 66 deletions(-)
> 
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 55d6efdbea64..b94f847960c2 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -347,50 +347,10 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
>  
>  /* Checkpoint list management */
>  
> -/*
> - * journal_clean_one_cp_list
> - *
> - * Find all the written-back checkpoint buffers in the given list and
> - * release them. If 'destroy' is set, clean all buffers unconditionally.
> - *
> - * Called with j_list_lock held.
> - * Returns 1 if we freed the transaction, 0 otherwise.
> - */
> -static int journal_clean_one_cp_list(struct journal_head *jh, bool destroy)
> -{
> -	struct journal_head *last_jh;
> -	struct journal_head *next_jh = jh;
> -
> -	if (!jh)
> -		return 0;
> -
> -	last_jh = jh->b_cpprev;
> -	do {
> -		jh = next_jh;
> -		next_jh = jh->b_cpnext;
> -
> -		if (!destroy && __cp_buffer_busy(jh))
> -			return 0;
> -
> -		if (__jbd2_journal_remove_checkpoint(jh))
> -			return 1;
> -		/*
> -		 * This function only frees up some memory
> -		 * if possible so we dont have an obligation
> -		 * to finish processing. Bail out if preemption
> -		 * requested:
> -		 */
> -		if (need_resched())
> -			return 0;
> -	} while (jh != last_jh);
> -
> -	return 0;
> -}
> -
>  /*
>   * journal_shrink_one_cp_list
>   *
> - * Find 'nr_to_scan' written-back checkpoint buffers in the given list
> + * Find all the written-back checkpoint buffers in the given list
>   * and try to release them. If the whole transaction is released, set
>   * the 'released' parameter. Return the number of released checkpointed
>   * buffers.
> @@ -398,15 +358,15 @@ static int journal_clean_one_cp_list(struct journal_head *jh, bool destroy)
>   * Called with j_list_lock held.
>   */
>  static unsigned long journal_shrink_one_cp_list(struct journal_head *jh,
> -						unsigned long *nr_to_scan,
> -						bool *released)
> +						bool destroy, bool *released)
>  {
>  	struct journal_head *last_jh;
>  	struct journal_head *next_jh = jh;
>  	unsigned long nr_freed = 0;
>  	int ret;
>  
> -	if (!jh || *nr_to_scan == 0)
> +	*released = false;
> +	if (!jh)
>  		return 0;
>  
>  	last_jh = jh->b_cpprev;
> @@ -414,8 +374,7 @@ static unsigned long journal_shrink_one_cp_list(struct journal_head *jh,
>  		jh = next_jh;
>  		next_jh = jh->b_cpnext;
>  
> -		(*nr_to_scan)--;
> -		if (__cp_buffer_busy(jh))
> +		if (!destroy && __cp_buffer_busy(jh))
>  			continue;
>  
>  		nr_freed++;
> @@ -427,7 +386,7 @@ static unsigned long journal_shrink_one_cp_list(struct journal_head *jh,
>  
>  		if (need_resched())
>  			break;
> -	} while (jh != last_jh && *nr_to_scan);
> +	} while (jh != last_jh);
>  
>  	return nr_freed;
>  }
> @@ -445,11 +404,11 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
>  						  unsigned long *nr_to_scan)
>  {
>  	transaction_t *transaction, *last_transaction, *next_transaction;
> -	bool released;
> +	bool __maybe_unused released;
>  	tid_t first_tid = 0, last_tid = 0, next_tid = 0;
>  	tid_t tid = 0;
>  	unsigned long nr_freed = 0;
> -	unsigned long nr_scanned = *nr_to_scan;
> +	unsigned long freed;
>  
>  again:
>  	spin_lock(&journal->j_list_lock);
> @@ -478,10 +437,11 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
>  		transaction = next_transaction;
>  		next_transaction = transaction->t_cpnext;
>  		tid = transaction->t_tid;
> -		released = false;
>  
> -		nr_freed += journal_shrink_one_cp_list(transaction->t_checkpoint_list,
> -						       nr_to_scan, &released);
> +		freed = journal_shrink_one_cp_list(transaction->t_checkpoint_list,
> +						   false, &released);
> +		nr_freed += freed;
> +		(*nr_to_scan) -= min(*nr_to_scan, freed);
>  		if (*nr_to_scan == 0)
>  			break;
>  		if (need_resched() || spin_needbreak(&journal->j_list_lock))
> @@ -502,9 +462,8 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
>  	if (*nr_to_scan && next_tid)
>  		goto again;
>  out:
> -	nr_scanned -= *nr_to_scan;
>  	trace_jbd2_shrink_checkpoint_list(journal, first_tid, tid, last_tid,
> -					  nr_freed, nr_scanned, next_tid);
> +					  nr_freed, next_tid);
>  
>  	return nr_freed;
>  }
> @@ -520,7 +479,7 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
>  void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy)
>  {
>  	transaction_t *transaction, *last_transaction, *next_transaction;
> -	int ret;
> +	bool released;
>  
>  	transaction = journal->j_checkpoint_transactions;
>  	if (!transaction)
> @@ -531,8 +490,8 @@ void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy)
>  	do {
>  		transaction = next_transaction;
>  		next_transaction = transaction->t_cpnext;
> -		ret = journal_clean_one_cp_list(transaction->t_checkpoint_list,
> -						destroy);
> +		journal_shrink_one_cp_list(transaction->t_checkpoint_list,
> +					   destroy, &released);
>  		/*
>  		 * This function only frees up some memory if possible so we
>  		 * dont have an obligation to finish processing. Bail out if
> @@ -545,7 +504,7 @@ void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy)
>  		 * avoids pointless scanning of transactions which still
>  		 * weren't checkpointed.
>  		 */
> -		if (!ret)
> +		if (!released)
>  			return;
>  	} while (transaction != last_transaction);
>  }
> diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
> index 8f5ee380d309..5646ae15a957 100644
> --- a/include/trace/events/jbd2.h
> +++ b/include/trace/events/jbd2.h
> @@ -462,11 +462,9 @@ TRACE_EVENT(jbd2_shrink_scan_exit,
>  TRACE_EVENT(jbd2_shrink_checkpoint_list,
>  
>  	TP_PROTO(journal_t *journal, tid_t first_tid, tid_t tid, tid_t last_tid,
> -		 unsigned long nr_freed, unsigned long nr_scanned,
> -		 tid_t next_tid),
> +		 unsigned long nr_freed, tid_t next_tid),
>  
> -	TP_ARGS(journal, first_tid, tid, last_tid, nr_freed,
> -		nr_scanned, next_tid),
> +	TP_ARGS(journal, first_tid, tid, last_tid, nr_freed, next_tid),
>  
>  	TP_STRUCT__entry(
>  		__field(dev_t, dev)
> @@ -474,7 +472,6 @@ TRACE_EVENT(jbd2_shrink_checkpoint_list,
>  		__field(tid_t, tid)
>  		__field(tid_t, last_tid)
>  		__field(unsigned long, nr_freed)
> -		__field(unsigned long, nr_scanned)
>  		__field(tid_t, next_tid)
>  	),
>  
> @@ -484,15 +481,14 @@ TRACE_EVENT(jbd2_shrink_checkpoint_list,
>  		__entry->tid		= tid;
>  		__entry->last_tid	= last_tid;
>  		__entry->nr_freed	= nr_freed;
> -		__entry->nr_scanned	= nr_scanned;
>  		__entry->next_tid	= next_tid;
>  	),
>  
>  	TP_printk("dev %d,%d shrink transaction %u-%u(%u) freed %lu "
> -		  "scanned %lu next transaction %u",
> +		  "next transaction %u",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->first_tid, __entry->tid, __entry->last_tid,
> -		  __entry->nr_freed, __entry->nr_scanned, __entry->next_tid)
> +		  __entry->nr_freed, __entry->next_tid)
>  );
>  
>  #endif /* _TRACE_JBD2_H */
> -- 
> 2.31.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-06 13:59 ` [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint Zhang Yi
@ 2023-06-13  4:31   ` Theodore Ts'o
  2023-06-13  8:13     ` Zhihao Cheng
  0 siblings, 1 reply; 17+ messages in thread
From: Theodore Ts'o @ 2023-06-13  4:31 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3, chengzhihao1

There is something about this patch which is causing test runs to hang
when running "gce-xfstests -c ext4/adv -C 10 generic/475" at least
60-70% of the time.

When I took a closer look, the problem seems to be e2fsck is hanging
after a SEGV when running e2fsck -nf on the block device.  This then
causes the check script to hang, until the test appliance's safety
timer triggers and forces a shutdown of the test VM and aborts the
test run.

The cause of the hang is clearly an e2fsprogs bug --- no matter how
corrupted the file system is, e2fsck should never crash or hang.  So
something is clearly going wrong with e2fsck:

    ...
    Symlink /p1/dc/d14/dee/l154 (inode #2898) is invalid.
    Clear? no

    Entry 'l154' in /p1/dc/d14/dee (2753) has an incorrect filetype (was 7, should be 0).
    Fix? no

    corrupted size vs. prev_size
    Signal (6) SIGABRT si_code=SI_TKILL 

    (Note: "corrutped size vs prev_size" is issued by glibc when
    malloc's internal data structures have been corrupted.  So
    there is definitely something going very wrong with e2fsck.)

That being said, if I run the same test on the parent commit (patch
3/6, jbd2: remove journal_clean_one_cp_list()), e2fsck does *not* hang
or crash, and the regression tests complete.  So this patch is
changing the behavior of the kernel in terms of the file system that
is left behind after a large number of injected I/O errors.

My plan therefore is to drop patches 4/6 through 6/6 of this patch
series.  This will allow at least the "long standing metadata
corruption issue that happens from to time" to be addressed, and it
will give us time study what's going on here in more detail.  I've
captured the compressed file system image which is causing e2fsck
(version 1.47.0) to corrupt malloc's data structure, and I'll try see
what using Address Sanitizer or valgrind show about what's going on.

Looking at the patch, it looks pretty innocuous, and I don't
understand how this could be making a significant enough difference
that it's causing e2fsck, which had previously been working fine, to
now start tossing its cookies.  If you could double check the patch
and see you see anything that I might have missed in my code review,
I'd really appreciate it.

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-13  4:31   ` Theodore Ts'o
@ 2023-06-13  8:13     ` Zhihao Cheng
  2023-06-13 17:27       ` Theodore Ts'o
  0 siblings, 1 reply; 17+ messages in thread
From: Zhihao Cheng @ 2023-06-13  8:13 UTC (permalink / raw)
  To: Theodore Ts'o, Zhang Yi
  Cc: linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3

在 2023/6/13 12:31, Theodore Ts'o 写道:
> There is something about this patch which is causing test runs to hang
> when running "gce-xfstests -c ext4/adv -C 10 generic/475" at least
> 60-70% of the time.
> 
> When I took a closer look, the problem seems to be e2fsck is hanging
> after a SEGV when running e2fsck -nf on the block device.  This then
> causes the check script to hang, until the test appliance's safety
> timer triggers and forces a shutdown of the test VM and aborts the
> test run.
> 
> The cause of the hang is clearly an e2fsprogs bug --- no matter how
> corrupted the file system is, e2fsck should never crash or hang.  So
> something is clearly going wrong with e2fsck:
> 
>      ...
>      Symlink /p1/dc/d14/dee/l154 (inode #2898) is invalid.
>      Clear? no
> 
>      Entry 'l154' in /p1/dc/d14/dee (2753) has an incorrect filetype (was 7, should be 0).
>      Fix? no
> 
>      corrupted size vs. prev_size
>      Signal (6) SIGABRT si_code=SI_TKILL
> 
>      (Note: "corrutped size vs prev_size" is issued by glibc when
>      malloc's internal data structures have been corrupted.  So
>      there is definitely something going very wrong with e2fsck.)
>      
> That being said, if I run the same test on the parent commit (patch
> 3/6, jbd2: remove journal_clean_one_cp_list()), e2fsck does *not* hang
> or crash, and the regression tests complete.  So this patch is
> changing the behavior of the kernel in terms of the file system that
> is left behind after a large number of injected I/O errors.
> 
> My plan therefore is to drop patches 4/6 through 6/6 of this patch
> series.  This will allow at least the "long standing metadata
> corruption issue that happens from to time" to be addressed, and it
> will give us time study what's going on here in more detail.  I've
> captured the compressed file system image which is causing e2fsck
> (version 1.47.0) to corrupt malloc's data structure, and I'll try see
> what using Address Sanitizer or valgrind show about what's going on.
> 

Hi Ted, I tried to run './check generic/475' many rounds(1.47.0, 
5-Feb-2023), and I cannot reproduce the problem with this patch. Could 
you send me a compressed image which can trigger the problem with 'fsck 
-fn'?

I agree to make clear the problem first before applying this patch.

> Looking at the patch, it looks pretty innocuous, and I don't
> understand how this could be making a significant enough difference
> that it's causing e2fsck, which had previously been working fine, to
> now start tossing its cookies.  If you could double check the patch
> and see you see anything that I might have missed in my code review,
> I'd really appreciate it.
> 
> Thanks,
> 
> 					- Ted
> 
> .
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-13  8:13     ` Zhihao Cheng
@ 2023-06-13 17:27       ` Theodore Ts'o
  2023-06-14  5:42         ` Theodore Ts'o
  0 siblings, 1 reply; 17+ messages in thread
From: Theodore Ts'o @ 2023-06-13 17:27 UTC (permalink / raw)
  To: Zhihao Cheng
  Cc: Zhang Yi, linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3

On Tue, Jun 13, 2023 at 04:13:06PM +0800, Zhihao Cheng wrote:
> 
> Hi Ted, I tried to run './check generic/475' many rounds(1.47.0,
> 5-Feb-2023), and I cannot reproduce the problem with this patch.

What file system configuration (e.g., mke2fs options) were you using
when you ran generic/475?  I reproduced the problem with
CONFIG_LOCKDEP enabled, with the ext4/adv configuration[1], which
means that the file system was created using "mke2fs -t ext4 -O
inline_data,fast_commit".  The size of the test file system was 5 GiB.

[1] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/cfg/adv

At this point, it looks like the problem is timing specific.  When I
built at patch 3/6 of your patch series, I was no longer able to
trigger the failure using the CONFIG_LOCKDEP kernel --- specifically
using a kernel config generated using install-kconfig[2] with the
--lockdep command-line-option.

[2] https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig

However, when I built a kernel config without --lockdep, I was able to
trigger the problem for both the the ext4/adv and the ext4/ext3[1]
file system test scenario.  That is, doing a full regression test
suite using "gce-xfstests ltm -c ext4/all -g auto", the VM's for the
ext4/adv and ext4/ext3 VM's both hung while running generic/475.  And
using a non-lockdep kernel, the commands "gce-xfstests -c
ext4/adv,ext4/ext3 generic/475" would hang.  I ran this command twice,
to make sure there were no timing-related false negatives, and once we
hung while running generic/475 for ext4/adv, and once we hung while
running ext4/ext3:

% gce-xfstests ls -l
  ...
xfstests-tytso-20230613115748 34.172.36.63 - 6.4.0-rc5-xfstests-00057-ge86f802ab8d4 - 12:07 ext4/adv generic/475 - RUNNING
xfstests-tytso-20230613115802 34.133.66.61 - 6.4.0-rc5-xfstests-00057-ge86f802ab8d4 - 12:06 ext4/ext3 generic/475 - RUNNING

[3] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/cfg/ext3

Furthermore, when I rewond the git repo to just before this patch
series (which is currently at the end of the dev branch), the full
regression test suites ("-c ext4/all -g all") and the more specific
test run ("-C 5 -c ext4/adv,ext4/ext3 generic/475") did not hang.  I
am currently doing another bisect run using a non-lockdep kernel to
see if I can more detail.

> Could you send me a compressed image which can trigger the problem
> with 'fsck -fn'?

Sure.  I'll have to send that under a separate e-mail message, since
it's 15 megabytes.  It was created using "dd if=/dev/mapper/xt-vdc |
gzip -9 > broken-image-which-causes-e2fsck-to-segv.gz".
Unfortunately, I was not able to create a metadata-only dump because
the filesystem was too corrupted.  An attempt to run "e2image -Q
/dev/mapper/xt-vdc broken.qcow2" failed with:

e2image 1.47.0 (5-Feb-2023)
e2image: Corrupt extent header while iterating over inode 6016

I was able to run e2fsck compiled with clang's asan enabled, and
here's the ASAN report (this is against the master branch in
e2fsprogs's git repo, so it's a bit ahead of 1.47.0):

e2fsck 1.47.0 (5-Feb-2023)
=================================================================
==25033==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x625000009900 at pc 0x564bbf8ae405 bp 0x7ffdd82bf0e0 sp 0x7ffdd82be8b0
WRITE of size 4096 at 0x625000009900 thread T0
    #0 0x564bbf8ae404 in pread64 (/build/e2fsprogs-asan/e2fsck/e2fsck+0x24a404) (BuildId: e291b1c8655954ec4293b8635a561dc29c81a785)
    #1 0x564bbfd47532 in raw_read_blk /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/unix_io.c:240:12
    #2 0x564bbfd3965b in unix_read_blk64 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/unix_io.c:1079:17
    #3 0x564bbfc6cecd in io_channel_read_blk64 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/io_manager.c:78:10
    #4 0x564bbf9a3791 in e2fsck_pass1_check_symlink /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass1.c:241:7
    #5 0x564bbfa28a9f in e2fsck_process_bad_inode /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1990:8
    #6 0x564bbfa21e9c in check_dir_block /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1525:8
    #7 0x564bbfa18a3b in check_dir_block2 /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1034:8
    #8 0x564bbfb9ee4a in ext2fs_dblist_iterate3 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/dblist.c:216:9
    #9 0x564bbfb9ef79 in ext2fs_dblist_iterate2 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/dblist.c:229:9
    #10 0x564bbfa14529 in e2fsck_pass2 /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:190:20
    #11 0x564bbf980660 in e2fsck_run /usr/projects/e2fsprogs/e2fsprogs/e2fsck/e2fsck.c:262:3
    #12 0x564bbf95c774 in main /usr/projects/e2fsprogs/e2fsprogs/e2fsck/unix.c:1931:15
    #13 0x7f65d4e4e189 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7f65d4e4e244 in __libc_start_main csu/../csu/libc-start.c:381:3
    #15 0x564bbf892840 in _start (/build/e2fsprogs-asan/e2fsck/e2fsck+0x22e840) (BuildId: e291b1c8655954ec4293b8635a561dc29c81a785)

I'm still digging into finding the root cause; I'll let you know if I
find more.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-13 17:27       ` Theodore Ts'o
@ 2023-06-14  5:42         ` Theodore Ts'o
  2023-06-14 13:25           ` Zhang Yi
  0 siblings, 1 reply; 17+ messages in thread
From: Theodore Ts'o @ 2023-06-14  5:42 UTC (permalink / raw)
  To: Zhihao Cheng
  Cc: Zhang Yi, linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3

OK, some more updates.  First of all, the e2fsck hang in the ext4/adv
case is an inline_data bug in e2fsck/pass2.c:check_dir_block(); the
code is clearly buggy, and I'll be sending out a fix in the next day
or two.

I still don't understand why this patch series is changing the kernel
behaviour enough to change the resulting file system in such a way as
to unmask this bug.  The bug is triggered by file system corruption,
so the question is whether this patch series is somehow causing the
file system to be more corrupted than it otherwise would be.  I'm not
sure.

However, the ext4/ext3 hang *is* a real hang in the kernel space, and
generic/475 is not completing because the kernel seems to have ended
up deadlocking somehow.  With just the first patch in this patch
series ("jbd2: recheck chechpointing non-dirty buffer") we're getting
a kernel NULL pointer derefence:

[  310.447568] EXT4-fs error (device dm-7): ext4_check_bdev_write_error:223: comm fsstress: Error while async write back metadata
[  310.458038] EXT4-fs error (device dm-7): __ext4_get_inode_loc_noinmem:4467: inode #99400: block 393286: comm fsstress: unable to read itable block
[  310.458421] JBD2: IO error reading journal superblock
[  310.484755] EXT4-fs warning (device dm-7): ext4_end_bio:343: I/O error 10 writing to inode 36066 starting block 19083)
[  310.490956] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  310.490959] #PF: supervisor write access in kernel mode
[  310.490961] #PF: error_code(0x0002) - not-present page
[  310.490963] PGD 0 P4D 0 
[  310.490966] Oops: 0002 [#1] PREEMPT SMP PTI
[  310.490970] CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
[  310.490974] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
[  310.490976] RIP: 0010:jbd2_journal_set_features+0x13d/0x430
[  310.490985] Code: 0f 94 c0 44 20 e8 0f 85 e0 00 00 00 be 94 01 00 00 48 c7 c7 a1 33 59 b4 48 89 0c 24 4c 8b 7d 38 e8 a8 dc c5 ff 2e 2e 2e 31 c0 <f0> 49 0f ba 2f 02 48 8b 0c 24 0f 82 24 02 00 00 45 84 ed 8b 41 28
[  310.490988] RSP: 0018:ffffb9b441043b30 EFLAGS: 00010246
[  310.490990] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8edb447b8100
[  310.490993] RDX: 0000000000000000 RSI: 0000000000000194 RDI: ffffffffb45933a1
[  310.490994] RBP: ffff8edb45a62800 R08: ffffffffb460d6c0 R09: 0000000000000000
[  310.490996] R10: 204f49203a324442 R11: 4f49203a3244424a R12: 0000000000000000
[  310.490997] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[  310.490999] FS:  00007f2940cca740(0000) GS:ffff8edc19500000(0000) knlGS:0000000000000000
[  310.491005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  310.491007] CR2: 0000000000000000 CR3: 000000012543e003 CR4: 00000000003706e0
[  310.491009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  310.491011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  310.491012] Call Trace:
[  310.491016]  <TASK>
[  310.491019]  ? __die+0x23/0x60
[  310.491025]  ? page_fault_oops+0xa4/0x170
[  310.491029]  ? exc_page_fault+0x67/0x170
[  310.491032]  ? asm_exc_page_fault+0x26/0x30
[  310.491039]  ? jbd2_journal_set_features+0x13d/0x430
[  310.491043]  jbd2_journal_revoke+0x47/0x1e0
[  310.491046]  __ext4_forget+0xc3/0x1b0
[  310.491051]  ext4_free_blocks+0x214/0x2f0
[  310.491056]  ext4_free_branches+0xeb/0x270
[  310.491061]  ext4_ind_truncate+0x2bf/0x320
[  310.491065]  ext4_truncate+0x1e4/0x490
[  310.491069]  ext4_handle_inode_extension+0x1bd/0x2a0
[  310.491073]  ? iomap_dio_complete+0xaf/0x1d0
[  310.511141] ------------[ cut here ]------------
[  310.516121]  ext4_dio_write_iter+0x346/0x3e0
[  310.516132]  ? __handle_mm_fault+0x171/0x200
[  310.516135]  vfs_write+0x21a/0x3e0
[  310.516140]  ksys_write+0x6f/0xf0
[  310.516142]  do_syscall_64+0x3b/0x90
[  310.516147]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  310.516154] RIP: 0033:0x7f2940eb2fb3
[  310.516158] Code: 75 05 48 83 c4 58 c3 e8 cb 41 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[  310.516161] RSP: 002b:00007ffe9a322cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  310.516165] RAX: ffffffffffffffda RBX: 0000000000003000 RCX: 00007f2940eb2fb3
[  310.516167] RDX: 0000000000003000 RSI: 0000556ba1e31000 RDI: 0000000000000003
[  310.516168] RBP: 0000000000000003 R08: 0000556ba1e31000 R09: 00007f2940e9bbe0
[  310.516170] R10: 0000556b9fedbf59 R11: 0000000000000246 R12: 0000000000000024
[  310.516172] R13: 00000000000cf000 R14: 0000556ba1e31000 R15: 0000000000000000
[  310.516174]  </TASK>
[  310.516178] CR2: 0000000000000000
[  310.516181] ---[ end trace 0000000000000000 ]---

This is then causing fsstress to wedge:

# ps -ax -o pid,user,wchan:20,args --sort pid
    PID USER     WCHAN                COMMAND
	...
  12860 root     do_wait              /bin/bash /root/xfstests/tests/generic/475
  13086 root     rescuer_thread       [kdmflush/253:7]
  15593 root     rescuer_thread       [ext4-rsv-conver]
  15598 root     jbd2_log_wait_commit ./ltp/fsstress -d /xt-vdc -n 999999 -p 4
  15600 root     ext4_release_file    [fsstress]
  15601 root     exit_aio             [fsstress]

So at this point, I'm going to drop this entire patch series from the
dev tree, since this *does* seem to be some kind of regression
triggered by the first patch in the patch series.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-14  5:42         ` Theodore Ts'o
@ 2023-06-14 13:25           ` Zhang Yi
  2023-06-14 20:37             ` Theodore Ts'o
  0 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2023-06-14 13:25 UTC (permalink / raw)
  To: Theodore Ts'o, Zhihao Cheng
  Cc: linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3

On 2023/6/14 13:42, Theodore Ts'o wrote:
> OK, some more updates.  First of all, the e2fsck hang in the ext4/adv
> case is an inline_data bug in e2fsck/pass2.c:check_dir_block(); the
> code is clearly buggy, and I'll be sending out a fix in the next day
> or two.
> 
> I still don't understand why this patch series is changing the kernel
> behaviour enough to change the resulting file system in such a way as
> to unmask this bug.  The bug is triggered by file system corruption,
> so the question is whether this patch series is somehow causing the
> file system to be more corrupted than it otherwise would be.  I'm not
> sure.
> 
> However, the ext4/ext3 hang *is* a real hang in the kernel space, and
> generic/475 is not completing because the kernel seems to have ended
> up deadlocking somehow.  With just the first patch in this patch
> series ("jbd2: recheck chechpointing non-dirty buffer") we're getting
> a kernel NULL pointer derefence:
> 
> [  310.447568] EXT4-fs error (device dm-7): ext4_check_bdev_write_error:223: comm fsstress: Error while async write back metadata
> [  310.458038] EXT4-fs error (device dm-7): __ext4_get_inode_loc_noinmem:4467: inode #99400: block 393286: comm fsstress: unable to read itable block
> [  310.458421] JBD2: IO error reading journal superblock
> [  310.484755] EXT4-fs warning (device dm-7): ext4_end_bio:343: I/O error 10 writing to inode 36066 starting block 19083)
> [  310.490956] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  310.490959] #PF: supervisor write access in kernel mode
> [  310.490961] #PF: error_code(0x0002) - not-present page
> [  310.490963] PGD 0 P4D 0 
> [  310.490966] Oops: 0002 [#1] PREEMPT SMP PTI
> [  310.490970] CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
> [  310.490974] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
> [  310.490976] RIP: 0010:jbd2_journal_set_features+0x13d/0x430
> [  310.490985] Code: 0f 94 c0 44 20 e8 0f 85 e0 00 00 00 be 94 01 00 00 48 c7 c7 a1 33 59 b4 48 89 0c 24 4c 8b 7d 38 e8 a8 dc c5 ff 2e 2e 2e 31 c0 <f0> 49 0f ba 2f 02 48 8b 0c 24 0f 82 24 02 00 00 45 84 ed 8b 41 28
> [  310.490988] RSP: 0018:ffffb9b441043b30 EFLAGS: 00010246
> [  310.490990] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8edb447b8100
> [  310.490993] RDX: 0000000000000000 RSI: 0000000000000194 RDI: ffffffffb45933a1
> [  310.490994] RBP: ffff8edb45a62800 R08: ffffffffb460d6c0 R09: 0000000000000000
> [  310.490996] R10: 204f49203a324442 R11: 4f49203a3244424a R12: 0000000000000000
> [  310.490997] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [  310.490999] FS:  00007f2940cca740(0000) GS:ffff8edc19500000(0000) knlGS:0000000000000000
> [  310.491005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  310.491007] CR2: 0000000000000000 CR3: 000000012543e003 CR4: 00000000003706e0
> [  310.491009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  310.491011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  310.491012] Call Trace:
> [  310.491016]  <TASK>
> [  310.491019]  ? __die+0x23/0x60
> [  310.491025]  ? page_fault_oops+0xa4/0x170
> [  310.491029]  ? exc_page_fault+0x67/0x170
> [  310.491032]  ? asm_exc_page_fault+0x26/0x30
> [  310.491039]  ? jbd2_journal_set_features+0x13d/0x430
> [  310.491043]  jbd2_journal_revoke+0x47/0x1e0
> [  310.491046]  __ext4_forget+0xc3/0x1b0
> [  310.491051]  ext4_free_blocks+0x214/0x2f0
> [  310.491056]  ext4_free_branches+0xeb/0x270
> [  310.491061]  ext4_ind_truncate+0x2bf/0x320
> [  310.491065]  ext4_truncate+0x1e4/0x490
> [  310.491069]  ext4_handle_inode_extension+0x1bd/0x2a0
> [  310.491073]  ? iomap_dio_complete+0xaf/0x1d0
> [  310.511141] ------------[ cut here ]------------
> [  310.516121]  ext4_dio_write_iter+0x346/0x3e0
> [  310.516132]  ? __handle_mm_fault+0x171/0x200
> [  310.516135]  vfs_write+0x21a/0x3e0
> [  310.516140]  ksys_write+0x6f/0xf0
> [  310.516142]  do_syscall_64+0x3b/0x90
> [  310.516147]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  310.516154] RIP: 0033:0x7f2940eb2fb3
> [  310.516158] Code: 75 05 48 83 c4 58 c3 e8 cb 41 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
> [  310.516161] RSP: 002b:00007ffe9a322cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  310.516165] RAX: ffffffffffffffda RBX: 0000000000003000 RCX: 00007f2940eb2fb3
> [  310.516167] RDX: 0000000000003000 RSI: 0000556ba1e31000 RDI: 0000000000000003
> [  310.516168] RBP: 0000000000000003 R08: 0000556ba1e31000 R09: 00007f2940e9bbe0
> [  310.516170] R10: 0000556b9fedbf59 R11: 0000000000000246 R12: 0000000000000024
> [  310.516172] R13: 00000000000cf000 R14: 0000556ba1e31000 R15: 0000000000000000
> [  310.516174]  </TASK>
> [  310.516178] CR2: 0000000000000000
> [  310.516181] ---[ end trace 0000000000000000 ]---
> 

Sorry about the regression, I found that this issue is not introduced
by the first patch in this patch series ("jbd2: recheck chechpointing
non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
initialization from journal_get_superblock()") [1] on your dev branch.

The problem is the journal super block had been failed to write out
due to IO fault, it's uptodate bit was cleared by
end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
unfortunately, the read IO is also fail, so the error handling in
journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
to above NULL pointer dereference issue.

I think the fix could be just move buffer_verified(bh) in front of
bh_read(). I can send out the fix after tests.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=d9eafe0afafaa519953735498c2a065d223c519b

Thanks,
Yi.

> This is then causing fsstress to wedge:
> 
> # ps -ax -o pid,user,wchan:20,args --sort pid
>     PID USER     WCHAN                COMMAND
> 	...
>   12860 root     do_wait              /bin/bash /root/xfstests/tests/generic/475
>   13086 root     rescuer_thread       [kdmflush/253:7]
>   15593 root     rescuer_thread       [ext4-rsv-conver]
>   15598 root     jbd2_log_wait_commit ./ltp/fsstress -d /xt-vdc -n 999999 -p 4
>   15600 root     ext4_release_file    [fsstress]
>   15601 root     exit_aio             [fsstress]
> 
> So at this point, I'm going to drop this entire patch series from the
> dev tree, since this *does* seem to be some kind of regression
> triggered by the first patch in the patch series.
> 
> Regards,
> 
> 					- Ted
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-14 13:25           ` Zhang Yi
@ 2023-06-14 20:37             ` Theodore Ts'o
  2023-06-15  3:56               ` Zhang Yi
  2023-06-26  7:36               ` Zhang Yi
  0 siblings, 2 replies; 17+ messages in thread
From: Theodore Ts'o @ 2023-06-14 20:37 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Zhihao Cheng, linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3

[-- Attachment #1: Type: text/plain, Size: 1943 bytes --]

On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
> 
> Sorry about the regression, I found that this issue is not introduced
> by the first patch in this patch series ("jbd2: recheck chechpointing
> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
> initialization from journal_get_superblock()") [1] on your dev branch.
> 
> The problem is the journal super block had been failed to write out
> due to IO fault, it's uptodate bit was cleared by
> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
> unfortunately, the read IO is also fail, so the error handling in
> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
> to above NULL pointer dereference issue.

Thanks for looking into this.  What I believe you are saying is that
the root cause is that earlier patch, but it is still something about
the patch "jbd2: recheck chechpointing non-dirty buffer" which is
changing the timing enough that we're hitting this buffer (because
without the "recheck checkpointing" patch, I'm not seeing the NULL
pointer dereference.

As far as the e2fsck bug that was causing it to hang in the ext4/adv
test scenario, the patch was a simple one, and I have also checked in
a test case which was a reliable reproducer of the problem.  (See
attached for the patches and more detail.)

It is really interesting that "recheck checkpointing" patch is making
enough of a difference that it is unmasking these bugs.  If you could
take a look at these changes and perhaps think about how this patch
series could be changing the nature of the corruption (specifically,
how symlink inodes referenced from inline directories could be
corupted with "rechecking checkpointing", thus unmasking the
e2fsprogs, I'd really appreciate it.

Thanks,

					- Ted


[-- Attachment #2: 0001-e2fsck-fix-handling-of-a-invalid-symlink-in-an-inlin.patch --]
[-- Type: text/x-diff, Size: 1608 bytes --]

From 8798bbb81687103b0c0f56a42b096884c6032101 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Wed, 14 Jun 2023 14:44:19 -0400
Subject: [PATCH 1/2] e2fsck: fix handling of a invalid symlink in an
 inline_data directory

If there is an inline directory that contains a directory entry to an
invalid symlink, and that invalid symlink is the portion of the inline
directory stored in an xattr portion of the inode, this can result in
a buffer overrun.

When check_dir_block() is handling the in-xattr portion of the inline
directory, it sets the buf pointer to the beginning of that part of
the inline directory.  This results in the scratch buffer passed to
e2fsck_process_bad_inode() to incorrect, resulting in a buffer overrun
if e2fsck_pass1_check_symlink() needs to read the symlink target (when
the symlink is too long to fit in the i_blocks[] space).

This commit fixes this by using the original cd->buf instead of buf,
since it can get modified when handling inline directories.

Fixes: 0ac4b3973f31 ("e2fsck: inspect inline dir data as two directory blocks")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 e2fsck/pass2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
index 47f9206f..42f3e5ef 100644
--- a/e2fsck/pass2.c
+++ b/e2fsck/pass2.c
@@ -1523,7 +1523,7 @@ skip_checksum:
 					     dirent->inode)) {
 			if (e2fsck_process_bad_inode(ctx, ino,
 						     dirent->inode,
-						     buf + fs->blocksize)) {
+						     cd->buf + fs->blocksize)) {
 				dirent->inode = 0;
 				dir_modified++;
 				goto next;
-- 
2.31.0


[-- Attachment #3: 0002-tests-add-test-for-handling-an-invalid-symlink-in-an.patch --]
[-- Type: text/x-diff, Size: 3922 bytes --]

From 541ce8f2bb6f91834b5d5b7c98bd4de8998dccc5 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Wed, 14 Jun 2023 15:15:48 -0400
Subject: [PATCH 2/2] tests: add test for handling an invalid symlink in an
 inline directory

Add a test for the commit "e2fsck: fix handling of a invalid symlink
in an inline_data directory"

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 tests/f_inlinedir_bad_symlink/expect.1 |  12 ++++++++++++
 tests/f_inlinedir_bad_symlink/expect.2 |   7 +++++++
 tests/f_inlinedir_bad_symlink/image.gz | Bin 0 -> 1797 bytes
 tests/f_inlinedir_bad_symlink/name     |   1 +
 4 files changed, 20 insertions(+)
 create mode 100644 tests/f_inlinedir_bad_symlink/expect.1
 create mode 100644 tests/f_inlinedir_bad_symlink/expect.2
 create mode 100644 tests/f_inlinedir_bad_symlink/image.gz
 create mode 100644 tests/f_inlinedir_bad_symlink/name

diff --git a/tests/f_inlinedir_bad_symlink/expect.1 b/tests/f_inlinedir_bad_symlink/expect.1
new file mode 100644
index 00000000..e1d0e879
--- /dev/null
+++ b/tests/f_inlinedir_bad_symlink/expect.1
@@ -0,0 +1,12 @@
+Pass 1: Checking inodes, blocks, and sizes
+Pass 2: Checking directory structure
+Symlink /a/7 (inode #19) is invalid.
+Clear? yes
+
+Pass 3: Checking directory connectivity
+Pass 4: Checking reference counts
+Pass 5: Checking group summary information
+
+test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
+test_filesys: 19/112 files (0.0% non-contiguous), 16/200 blocks
+Exit status is 1
diff --git a/tests/f_inlinedir_bad_symlink/expect.2 b/tests/f_inlinedir_bad_symlink/expect.2
new file mode 100644
index 00000000..b881d657
--- /dev/null
+++ b/tests/f_inlinedir_bad_symlink/expect.2
@@ -0,0 +1,7 @@
+Pass 1: Checking inodes, blocks, and sizes
+Pass 2: Checking directory structure
+Pass 3: Checking directory connectivity
+Pass 4: Checking reference counts
+Pass 5: Checking group summary information
+test_filesys: 19/112 files (0.0% non-contiguous), 16/200 blocks
+Exit status is 0
diff --git a/tests/f_inlinedir_bad_symlink/image.gz b/tests/f_inlinedir_bad_symlink/image.gz
new file mode 100644
index 0000000000000000000000000000000000000000..c5bd71a3b5d3e685ce1ca34650e9325aabf2f498
GIT binary patch
literal 1797
zcmeHFZ!p^j7>>HF^p-;%>Xe*+jV3K){xsaU$rPp24pl16*2p3>RYJ_4h^f{o>QB#A
zS(9aDRvS4idKUJps<@VkrCUOXUoE<bG@>FwB){L-=YHt7ed+t{dG3AQ=e_5BQ8iXp
zBHNo8`z)!nFDYb%Mn=m_CLczuDKFQk(zgd<tXpGHm3diYU!Tv_*J3lzo%z05_PhA*
zNmb{u;3yK;%Q>q8bx@6`dL2Dw8{_BUfOO~Di1%}w*1TKdDmNHxO?s;$?EyVo0DxPd
z5+{-yB48bZL)RiLe3Jb-a((K$0u3GhAufIs9)XNcO~WO=fl5ioS-bV^@z2oivE#S+
zLCOO#_NnZX*+{0Wv=>u(<ECxd7Wu_g0($a1Kpoj3rkwtm->xpMEy-1_XA{)JG_0sD
zHHapQ#%ejtjYoy{w?D)sQA7nj_XV}5SWO-Z2i1$lGR9$K9!8HNQGouDbBJ=Wy>e?*
zd`KP+b6lR`qjt7#oaQ6xkGG9+W;vVBlgtz&5QyoJB`oGKXDqz^ndL4tN104Os290l
zw$P#rpPEV_jBnzyBa>&F$h>0!ZL!ZGj?Og;`l{YXlXb(ieY+B7WU(LoTw=ywMCDiS
zz{|^D%vtZ!Mx&WEv1o1c1ajWo-ZRS4!s8(*Y94H}=L175sDeTl9WRBhN7d{wL~ifx
z(r;PI;QDr(Z`z)_5v47EOVyj&uHyc?#vh5ePKmxuj_z!?!AG_fadCxx-JYTB2UeD{
zIScn#)K6)*^OtgRe>P$0@K9*w*~QTyboiB`03li-nbbVhfO9g(W&<u|he?%2sN<<p
zz}UY)tP=jPvG{1oLkfEdAi`jm??mjKmq*(a`ka}O9v;4<59VZIH40Wn990U_61BY1
zZ%)9@#LG~M<H-yi=S&8axDby2HB@(GOb1U_v%=wtGJ!+~2!CN7zXuP08^wht8%Acu
zdTLUY*WHGJkPLH}m~mAC^ph~oSz?HAFR#{3@E!mB>KL*YG7Z^%gw@u(CXH6CoPcZT
zBUAd6B*v@i{r!#erk6TWSl~pMKdp*v7CdKs9SsD?hZcplbgx*m_X^nPdaawf5Sd=e
zcbmRglB{mlYF+uB5kA4cJ@oFGD+tSG%M4`qOVEtr&)1;6p2at7RAWN0yH1}2E`Vy@
zJfM|aQq;3+^`aRGsOAi<KH0^BE4TD4`UK2{bi2dz3(bRk<pRcNOmR{JOp<=+uLLKL
zf9h!oQX&*h@3EB=VB@gjq@zoY%vDW%`1h8IaN&TRZhW9~MFiaG!58&Pfh8gc|7fbm
zg-4VPEXOOQLW3Gyo8qzjS+6!T%)L{dlr}nDJYUss;H+Nom2#K3MPtJ3;4vcw2?l&+
zY^#r#5#Sv*rKu&tP-5IdwYn8vUpDCA#X8HHI!@>zHH)NEbl;~q>@D`Dk9x<wOW=P=
Rp!}f0Nz*WBg(&|3@h^7vKm7mz

literal 0
HcmV?d00001

diff --git a/tests/f_inlinedir_bad_symlink/name b/tests/f_inlinedir_bad_symlink/name
new file mode 100644
index 00000000..f7f7f0d6
--- /dev/null
+++ b/tests/f_inlinedir_bad_symlink/name
@@ -0,0 +1 @@
+bad symlink in an inline directory
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-14 20:37             ` Theodore Ts'o
@ 2023-06-15  3:56               ` Zhang Yi
  2023-06-26  7:36               ` Zhang Yi
  1 sibling, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2023-06-15  3:56 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Zhihao Cheng, linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3

On 2023/6/15 4:37, Theodore Ts'o wrote:
> On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
>>
>> Sorry about the regression, I found that this issue is not introduced
>> by the first patch in this patch series ("jbd2: recheck chechpointing
>> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
>> initialization from journal_get_superblock()") [1] on your dev branch.
>>
>> The problem is the journal super block had been failed to write out
>> due to IO fault, it's uptodate bit was cleared by
>> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
>> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
>> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
>> unfortunately, the read IO is also fail, so the error handling in
>> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
>> to above NULL pointer dereference issue.
> 
> Thanks for looking into this.  What I believe you are saying is that
> the root cause is that earlier patch, but it is still something about
> the patch "jbd2: recheck chechpointing non-dirty buffer" which is
> changing the timing enough that we're hitting this buffer (because
> without the "recheck checkpointing" patch, I'm not seeing the NULL
> pointer dereference.

I have send out a separate patch names "jbd2: skip reading super block
if it has been verified" to fix above NULL pointer dereference issue, I
have been runing ext3 generic/475 about 12hours and have not reproduced
the problem again (I will also do more tests later). Please take a look
at it.

> 
> As far as the e2fsck bug that was causing it to hang in the ext4/adv
> test scenario, the patch was a simple one, and I have also checked in
> a test case which was a reliable reproducer of the problem.  (See
> attached for the patches and more detail.)
> 
> It is really interesting that "recheck checkpointing" patch is making
> enough of a difference that it is unmasking these bugs.  If you could
> take a look at these changes and perhaps think about how this patch
> series could be changing the nature of the corruption (specifically,
> how symlink inodes referenced from inline directories could be
> corupted with "rechecking checkpointing", thus unmasking the
> e2fsprogs, I'd really appreciate it.
> 

Sure, we will take a look at it for details.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
  2023-06-14 20:37             ` Theodore Ts'o
  2023-06-15  3:56               ` Zhang Yi
@ 2023-06-26  7:36               ` Zhang Yi
  1 sibling, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2023-06-26  7:36 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Zhihao Cheng, linux-ext4, adilger.kernel, jack, yi.zhang, yukuai3

On 2023/6/15 4:37, Theodore Ts'o wrote:
> On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
>>
>> Sorry about the regression, I found that this issue is not introduced
>> by the first patch in this patch series ("jbd2: recheck chechpointing
>> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
>> initialization from journal_get_superblock()") [1] on your dev branch.
>>
>> The problem is the journal super block had been failed to write out
>> due to IO fault, it's uptodate bit was cleared by
>> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
>> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
>> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
>> unfortunately, the read IO is also fail, so the error handling in
>> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
>> to above NULL pointer dereference issue.
> 
> Thanks for looking into this.  What I believe you are saying is that
> the root cause is that earlier patch, but it is still something about
> the patch "jbd2: recheck chechpointing non-dirty buffer" which is
> changing the timing enough that we're hitting this buffer (because
> without the "recheck checkpointing" patch, I'm not seeing the NULL
> pointer dereference.
> 
> As far as the e2fsck bug that was causing it to hang in the ext4/adv
> test scenario, the patch was a simple one, and I have also checked in
> a test case which was a reliable reproducer of the problem.  (See
> attached for the patches and more detail.)
> 
> It is really interesting that "recheck checkpointing" patch is making
> enough of a difference that it is unmasking these bugs.  If you could
> take a look at these changes and perhaps think about how this patch
> series could be changing the nature of the corruption (specifically,
> how symlink inodes referenced from inline directories could be
> corupted with "rechecking checkpointing", thus unmasking the
> e2fsprogs, I'd really appreciate it.
> 

Hello Ted.

I have found the root cause of which trigger the e2fsck bug, it's a
real kernel bug which was introduced in 5b849b5f96b4 ("jbd2: fast
commit recovery path") when merging fast commit feature.

The problem is that when fast commit is enabled, kernel doesn't replay
the journal completely at mount time (there a bug in do_one_pass(),
see below for details) if the unrecovered transactions loop back to
the head of the normal journal area. If the missing transactions
contain a symlink block revoke record and a reuse record of the same
block, and the reuse record have been write back to the disk before
it's last umount, it could trigger the "Symlink xxx is invalid" after
the incomplete journal replay. Meanwhile it the symlink belongs to a
inline directory, it will trigger the e2fsck bug.

For example, we have a not cleaned image below.

 | normal journal area                             | fast commit area |
 +-------------------------------------------------+------------------+
 | tX+1(rere half)|tX+2|...| tX | tX+1(front half) |       ....       |
 +-------------------------------------------------+------------------+
                       /        /                  /                  /
             (real head)  s_start    journal->j_last journal->j_fc_last

Transaction X(checkpointed):
 - Create symlink 'foo' (use block A, contain 500 length of link name)
   in inline directory 'dir';
Transaction X+1(uncheckpoint):
 - Remove symlink 'foo' (block A is also freed);
Transaction X+2(uncheckpoint):
 - Create symlink 'bar' (reuse block A, contain 400 length of link
   name).

The above transactions are recorded to the journal, block A is also
successfully written back by the background write-back process.

If fast_commit feature is enabled, the journal->j_last point to the
first unused block behind the normal journal area instead of the whole
log area, and the journal->j_fc_last point to the first unused block
behind the fast_commit journal area. While doing journal recovery,
do_one_pass(PASS_SCAN) should first scan tX+1 in the normal journal
area and turn around to the first block once it meet journal->j_last,
but the wrap() macro misuse the journal->j_fc_last, so it could not
read tX+1's commit block, the recovery quit early mistakenly and
missing tX+1 and tX+2. So, after we mount the filesystem, we could
leftover an invalid symlink 'foo' in the inline directory and trigger
issue.

I can reproduce this issue either with or without this patch series
("jbd2: recheck chechpointing non-dirty buffer"), sometimes it takes
longer, sometimes it takes less. It's easy to reproduce the "Symlink
xxx is invalid" issue, but it's a little hard to trigger the e2fsck
bug (which means the invalid symlink should in a inline dir).
Especially I could 100% reproduce the fast_commit recovery bug when
running generic/475. So I couldn't find any relations between this
issue and this patch series.

I've send a patch to fix the fast commit bug separately[1]. I cannot
reproduce this issue again with this patch, please take a look at
that.

[1] https://lore.kernel.org/linux-ext4/20230626073322.3956567-1-yi.zhang@huaweicloud.com/T/#u

Thanks,
Yi.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues
  2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
                   ` (5 preceding siblings ...)
  2023-06-06 13:59 ` [PATCH v3 6/6] jbd2: remove __journal_try_to_free_buffer() Zhang Yi
@ 2023-07-12 18:29 ` Theodore Ts'o
  6 siblings, 0 replies; 17+ messages in thread
From: Theodore Ts'o @ 2023-07-12 18:29 UTC (permalink / raw)
  To: linux-ext4, Zhang Yi
  Cc: Theodore Ts'o, adilger.kernel, jack, yi.zhang, yukuai3,
	chengzhihao1


On Tue, 06 Jun 2023 21:59:22 +0800, Zhang Yi wrote:
> v2->v3:
>  - Init released parameter in journal_shrink_one_cp_list() instead of
>    __jbd2_journal_clean_checkpoint_list() in patch 3.
>  - Fix a comment in patch 5.
> v1->v2:
>  - Drop the last patch in [1].
>  - Merge journal_clean_one_cp_list() into journal_shrink_one_cp_list().
>  - Fix the race issues through trying to hold buffer lock and check
>    dirty state under the lock.
>  - Append a cleanup patch, remove __journal_try_to_free_buffer().
> 
> [...]

Applied, thanks!

[1/6] jbd2: recheck chechpointing non-dirty buffer
      commit: c2d6fd9d6f35079f1669f0100f05b46708c74b7f
[2/6] jbd2: remove t_checkpoint_io_list
      commit: be22255360f80d3af789daad00025171a65424a5
[3/6] jbd2: remove journal_clean_one_cp_list()
      commit: b98dba273a0e47dbfade89c9af73c5b012a4eabb
[4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
      commit: e34c8dd238d0c9368b746480f313055f5bab5040
[5/6] jbd2: fix a race when checking checkpoint buffer busy
      commit: 46f881b5b1758dc4a35fba4a643c10717d0cf427
[6/6] jbd2: remove __journal_try_to_free_buffer()
      commit: 3c55097c553c49deab60ac62c83ef17565004a97

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-07-12 18:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-06 13:59 [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Zhang Yi
2023-06-06 13:59 ` [PATCH v3 1/6] jbd2: recheck chechpointing non-dirty buffer Zhang Yi
2023-06-06 13:59 ` [PATCH v3 2/6] jbd2: remove t_checkpoint_io_list Zhang Yi
2023-06-06 13:59 ` [PATCH v3 3/6] jbd2: remove journal_clean_one_cp_list() Zhang Yi
2023-06-07  8:30   ` Jan Kara
2023-06-06 13:59 ` [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint Zhang Yi
2023-06-13  4:31   ` Theodore Ts'o
2023-06-13  8:13     ` Zhihao Cheng
2023-06-13 17:27       ` Theodore Ts'o
2023-06-14  5:42         ` Theodore Ts'o
2023-06-14 13:25           ` Zhang Yi
2023-06-14 20:37             ` Theodore Ts'o
2023-06-15  3:56               ` Zhang Yi
2023-06-26  7:36               ` Zhang Yi
2023-06-06 13:59 ` [PATCH v3 5/6] jbd2: fix a race when checking checkpoint buffer busy Zhang Yi
2023-06-06 13:59 ` [PATCH v3 6/6] jbd2: remove __journal_try_to_free_buffer() Zhang Yi
2023-07-12 18:29 ` [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).