[PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
@ 2024-07-11  8:35 Luis Henriques (SUSE)
  2024-07-11 13:32 ` wangjianjian (C)
  2024-07-16 10:24 ` [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Luis Henriques (SUSE) @ 2024-07-11  8:35 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Jan Kara, Harshad Shirwadkar
  Cc: linux-ext4, linux-kernel, Luis Henriques (SUSE)

When a full journal commit is on-going, any fast commit has to be enqueued
into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
is done only once, i.e. if an inode is already queued in a previous fast
commit entry it won't be enqueued again.  However, if a full commit starts
_after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
be done into FC_Q_STAGING.  And this is not being done in function
ext4_fc_track_template().

This patch fixes the issue by re-enqueuing an inode into the STAGING queue
during the fast commit clean-up callback if it has a tid (i_sync_tid)
greater than the one being handled.  The STAGING queue will then be spliced
back into MAIN.

This bug was found using fstest generic/047.  This test creates several 32k
bytes files, sync'ing each of them after it's creation, and then shutting
down the filesystem.  Some data may be loss in this operation; for example a
file may have it's size truncated to zero.

Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
---
Hi!

v4 of this patch enqueues the inode into STAGING *only* if the current tid
is non-zero.  It will be zero when doing an fc commit, and this would mean
to always re-enqueue the inode.  This fixes the regressions caught by Ted
in v3 with fstests generic/472 generic/496 generic/643.

Also, since 2nd patch of v3 has already been merged, I've rebased this patch
to be applied on top of it.

 fs/ext4/fast_commit.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 3926a05eceee..facbc8dbbaa2 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 				       EXT4_STATE_FC_COMMITTING);
 		if (tid_geq(tid, iter->i_sync_tid))
 			ext4_fc_reset_inode(&iter->vfs_inode);
+		} else if (tid) {
+			/*
+			 * If the tid is valid (i.e. non-zero) re-enqueue the
+			 * inode into STAGING, which will then be splice back
+			 * into MAIN
+			 */
+			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
+				      &sbi->s_fc_q[FC_Q_STAGING]);
+		}
+
 		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
 		smp_mb();
 #if (BITS_PER_LONG < 64)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-11  8:35 [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Luis Henriques (SUSE)
@ 2024-07-11 13:32 ` wangjianjian (C)
  2024-07-11 15:16   ` Luis Henriques
  2024-07-16 10:24 ` [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Jan Kara
  1 sibling, 1 reply; 14+ messages in thread
From: wangjianjian (C) @ 2024-07-11 13:32 UTC (permalink / raw)
  To: Luis Henriques (SUSE), Theodore Ts'o, Andreas Dilger,
	Jan Kara, Harshad Shirwadkar
  Cc: linux-ext4, linux-kernel

On 2024/7/11 16:35, Luis Henriques (SUSE) wrote:
> When a full journal commit is on-going, any fast commit has to be enqueued
> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
> is done only once, i.e. if an inode is already queued in a previous fast
> commit entry it won't be enqueued again.  However, if a full commit starts
> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
> be done into FC_Q_STAGING.  And this is not being done in function
> ext4_fc_track_template().
> 
> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
> during the fast commit clean-up callback if it has a tid (i_sync_tid)
> greater than the one being handled.  The STAGING queue will then be spliced
> back into MAIN.
> 
> This bug was found using fstest generic/047.  This test creates several 32k
> bytes files, sync'ing each of them after it's creation, and then shutting
> down the filesystem.  Some data may be loss in this operation; for example a
> file may have it's size truncated to zero.
> 
> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
> ---
> Hi!
> 
> v4 of this patch enqueues the inode into STAGING *only* if the current tid
> is non-zero.  It will be zero when doing an fc commit, and this would mean
> to always re-enqueue the inode.  This fixes the regressions caught by Ted
> in v3 with fstests generic/472 generic/496 generic/643.
> 
> Also, since 2nd patch of v3 has already been merged, I've rebased this patch
> to be applied on top of it.
> 
>   fs/ext4/fast_commit.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 3926a05eceee..facbc8dbbaa2 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>   				       EXT4_STATE_FC_COMMITTING);
>   		if (tid_geq(tid, iter->i_sync_tid))
>   			ext4_fc_reset_inode(&iter->vfs_inode);
> +		} else if (tid) {
> +			/*
> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
one quick question about tid, if one disk is using long time and its tid 
  get wrapped to 0, is it a valid seq? I don't find code handling this 
situation.
> +			 * inode into STAGING, which will then be splice back
> +			 * into MAIN
> +			 */
> +			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
> +				      &sbi->s_fc_q[FC_Q_STAGING]);
> +		}
> +
>   		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
>   		smp_mb();
>   #if (BITS_PER_LONG < 64)
> 
-- 
Regards


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-11 13:32 ` wangjianjian (C)
@ 2024-07-11 15:16   ` Luis Henriques
  2024-07-11 16:16     ` Wang Jianjian
  0 siblings, 1 reply; 14+ messages in thread
From: Luis Henriques @ 2024-07-11 15:16 UTC (permalink / raw)
  To: wangjianjian (C)
  Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Harshad Shirwadkar,
	linux-ext4, linux-kernel

On Thu, Jul 11 2024, wangjianjian (C) wrote:

> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote:
>> When a full journal commit is on-going, any fast commit has to be enqueued
>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
>> is done only once, i.e. if an inode is already queued in a previous fast
>> commit entry it won't be enqueued again.  However, if a full commit starts
>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>> be done into FC_Q_STAGING.  And this is not being done in function
>> ext4_fc_track_template().
>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
>> during the fast commit clean-up callback if it has a tid (i_sync_tid)
>> greater than the one being handled.  The STAGING queue will then be spliced
>> back into MAIN.
>> This bug was found using fstest generic/047.  This test creates several 32k
>> bytes files, sync'ing each of them after it's creation, and then shutting
>> down the filesystem.  Some data may be loss in this operation; for example a
>> file may have it's size truncated to zero.
>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>> ---
>> Hi!
>> v4 of this patch enqueues the inode into STAGING *only* if the current tid
>> is non-zero.  It will be zero when doing an fc commit, and this would mean
>> to always re-enqueue the inode.  This fixes the regressions caught by Ted
>> in v3 with fstests generic/472 generic/496 generic/643.
>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch
>> to be applied on top of it.
>>   fs/ext4/fast_commit.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>> index 3926a05eceee..facbc8dbbaa2 100644
>> --- a/fs/ext4/fast_commit.c
>> +++ b/fs/ext4/fast_commit.c
>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>   				       EXT4_STATE_FC_COMMITTING);
>>   		if (tid_geq(tid, iter->i_sync_tid))
>>   			ext4_fc_reset_inode(&iter->vfs_inode);
>> +		} else if (tid) {
>> +			/*
>> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
> one quick question about tid, if one disk is using long time and its tid   get
> wrapped to 0, is it a valid seq? I don't find code handling this situation.

Hmm... OK.  So, to answer to your question, the 'tid' is expected to wrap.
That's why we use:

	if (tid_geq(tid, iter->i_sync_tid))

instead of:

	if (tid >= iter->i_sync_tid)

(The second patch in v3 actually fixed a few places where the tid_*()
helpers weren't being used.)

But your question shows me that my patch is wrong as '0' may actually be a
valid 'tid' value.

Cheers,
-- 
Luís

>> +			 * inode into STAGING, which will then be splice back
>> +			 * into MAIN
>> +			 */
>> +			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
>> +				      &sbi->s_fc_q[FC_Q_STAGING]);
>> +		}
>> +
>>   		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
>>   		smp_mb();
>>   #if (BITS_PER_LONG < 64)
>> 
> -- 
> Regards
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-11 15:16   ` Luis Henriques
@ 2024-07-11 16:16     ` Wang Jianjian
  2024-07-11 19:28       ` Andreas Dilger
  0 siblings, 1 reply; 14+ messages in thread
From: Wang Jianjian @ 2024-07-11 16:16 UTC (permalink / raw)
  To: Luis Henriques, wangjianjian (C)
  Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Harshad Shirwadkar,
	linux-ext4, linux-kernel


On 2024/7/11 23:16, Luis Henriques wrote:
> On Thu, Jul 11 2024, wangjianjian (C) wrote:
>
>> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote:
>>> When a full journal commit is on-going, any fast commit has to be enqueued
>>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
>>> is done only once, i.e. if an inode is already queued in a previous fast
>>> commit entry it won't be enqueued again.  However, if a full commit starts
>>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>>> be done into FC_Q_STAGING.  And this is not being done in function
>>> ext4_fc_track_template().
>>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
>>> during the fast commit clean-up callback if it has a tid (i_sync_tid)
>>> greater than the one being handled.  The STAGING queue will then be spliced
>>> back into MAIN.
>>> This bug was found using fstest generic/047.  This test creates several 32k
>>> bytes files, sync'ing each of them after it's creation, and then shutting
>>> down the filesystem.  Some data may be loss in this operation; for example a
>>> file may have it's size truncated to zero.
>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>>> ---
>>> Hi!
>>> v4 of this patch enqueues the inode into STAGING *only* if the current tid
>>> is non-zero.  It will be zero when doing an fc commit, and this would mean
>>> to always re-enqueue the inode.  This fixes the regressions caught by Ted
>>> in v3 with fstests generic/472 generic/496 generic/643.
>>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch
>>> to be applied on top of it.
>>>    fs/ext4/fast_commit.c | 10 ++++++++++
>>>    1 file changed, 10 insertions(+)
>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>>> index 3926a05eceee..facbc8dbbaa2 100644
>>> --- a/fs/ext4/fast_commit.c
>>> +++ b/fs/ext4/fast_commit.c
>>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>>    				       EXT4_STATE_FC_COMMITTING);
>>>    		if (tid_geq(tid, iter->i_sync_tid))
>>>    			ext4_fc_reset_inode(&iter->vfs_inode);
>>> +		} else if (tid) {
>>> +			/*
>>> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
>> one quick question about tid, if one disk is using long time and its tid   get
>> wrapped to 0, is it a valid seq? I don't find code handling this situation.
> Hmm... OK.  So, to answer to your question, the 'tid' is expected to wrap.
> That's why we use:
>
> 	if (tid_geq(tid, iter->i_sync_tid))
Yes, I know this.
>
> instead of:
>
> 	if (tid >= iter->i_sync_tid)
>
> (The second patch in v3 actually fixed a few places where the tid_*()
> helpers weren't being used.)
>
> But your question shows me that my patch is wrong as '0' may actually be a
> valid 'tid' value.

Actually my question is,  there are some place use '0' to check if a 
transaction is valid, e.g.

In ext4_wait_for_tail_page_commit()

5218         while (1) {
5219                 struct folio *folio = 
filemap_lock_folio(inode->i_mapping,
5220                                       inode->i_size >> PAGE_SHIFT);
5221                 if (IS_ERR(folio))
5222                         return;
5223                 ret = __ext4_journalled_invalidate_folio(folio, offset,
5224 folio_size(folio) - offset);
5225                 folio_unlock(folio);
5226                 folio_put(folio);
5227                 if (ret != -EBUSY)
5228                         return;
5229                 commit_tid = 0;
5230                 read_lock(&journal->j_state_lock);
5231                 if (journal->j_committing_transaction)
5232                         commit_tid = 
journal->j_committing_transaction->t_tid;
5233                 read_unlock(&journal->j_state_lock);
5234                 if (commit_tid)
5235                         jbd2_log_wait_commit(journal, commit_tid);
5236         }
5237  We only wait commit if tid is not zero.

And in __jbd2_log_wait_for_space()

79                 if (space_left < nblocks) {
  80                         int chkpt = 
journal->j_checkpoint_transactions != NULL;
  81                         tid_t tid = 0;
  82
  83                         if (journal->j_committing_transaction)
  84                                 tid = 
journal->j_committing_transaction->t_tid;
  85 spin_unlock(&journal->j_list_lock);
  86 write_unlock(&journal->j_state_lock);
  87                         if (chkpt) {
  88 jbd2_log_do_checkpoint(journal);
  89                         } else if 
(jbd2_cleanup_journal_tail(journal) == 0) {
  90                                 /* We were able to recover space; 
yay! */
  91                                 ;
  92                         } else if (tid) {
  93                                 /*
  94                                  * 
jbd2_journal_commit_transaction() may want
  95                                  * to take the checkpoint_mutex if 
JBD2_FLUSHED
  96                                  * is set.  So we need to 
temporarily drop it.
  97                                  */
  98 mutex_unlock(&journal->j_checkpoint_mutex);
  99                                 jbd2_log_wait_commit(journal, tid);
100 write_lock(&journal->j_state_lock);
101                                 continue;
We also only wait commit if tid is not zero.

Does it mean all these have bugs if '0' is a valid 'tid' ?

But on the other hand, if we don't consider sync and fsync, and default 
commit interval is 5s,

time of tid wrap to 0 is nearly 680 years. However, we can run 
sync/fsync to make tid to increase

more quickly in real world ?


> Cheers,


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-11 16:16     ` Wang Jianjian
@ 2024-07-11 19:28       ` Andreas Dilger
  2024-07-12  0:51         ` wangjianjian (C)
  2024-07-12  9:15         ` Luis Henriques
  0 siblings, 2 replies; 14+ messages in thread
From: Andreas Dilger @ 2024-07-11 19:28 UTC (permalink / raw)
  To: Wang Jianjian
  Cc: Luis Henriques, wangjianjian (C), Theodore Ts'o, Jan Kara,
	Harshad Shirwadkar, linux-ext4, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6391 bytes --]

On Jul 11, 2024, at 10:16 AM, Wang Jianjian <wangjianjian0@foxmail.com> wrote:
> On 2024/7/11 23:16, Luis Henriques wrote:
>> On Thu, Jul 11 2024, wangjianjian (C) wrote:
>> 
>>> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote:
>>>> When a full journal commit is on-going, any fast commit has to be enqueued
>>>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
>>>> is done only once, i.e. if an inode is already queued in a previous fast
>>>> commit entry it won't be enqueued again.  However, if a full commit starts
>>>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>>>> be done into FC_Q_STAGING.  And this is not being done in function
>>>> ext4_fc_track_template().
>>>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
>>>> during the fast commit clean-up callback if it has a tid (i_sync_tid)
>>>> greater than the one being handled.  The STAGING queue will then be spliced
>>>> back into MAIN.
>>>> This bug was found using fstest generic/047.  This test creates several 32k
>>>> bytes files, sync'ing each of them after it's creation, and then shutting
>>>> down the filesystem.  Some data may be loss in this operation; for example a
>>>> file may have it's size truncated to zero.
>>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>>>> ---
>>>> Hi!
>>>> v4 of this patch enqueues the inode into STAGING *only* if the current tid
>>>> is non-zero.  It will be zero when doing an fc commit, and this would mean
>>>> to always re-enqueue the inode.  This fixes the regressions caught by Ted
>>>> in v3 with fstests generic/472 generic/496 generic/643.
>>>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch
>>>> to be applied on top of it.
>>>>   fs/ext4/fast_commit.c | 10 ++++++++++
>>>>   1 file changed, 10 insertions(+)
>>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>>>> index 3926a05eceee..facbc8dbbaa2 100644
>>>> --- a/fs/ext4/fast_commit.c
>>>> +++ b/fs/ext4/fast_commit.c
>>>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>>>   				       EXT4_STATE_FC_COMMITTING);
>>>>   		if (tid_geq(tid, iter->i_sync_tid))
>>>>   			ext4_fc_reset_inode(&iter->vfs_inode);
>>>> +		} else if (tid) {
>>>> +			/*
>>>> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
>>> one quick question about tid, if one disk is using long time and its tid   get
>>> wrapped to 0, is it a valid seq? I don't find code handling this situation.
>> Hmm... OK.  So, to answer to your question, the 'tid' is expected to wrap.
>> That's why we use:
>> 
>> 	if (tid_geq(tid, iter->i_sync_tid))
> Yes, I know this.
>> 
>> instead of:
>> 
>> 	if (tid >= iter->i_sync_tid)
>> 
>> (The second patch in v3 actually fixed a few places where the tid_*()
>> helpers weren't being used.)
>> 
>> But your question shows me that my patch is wrong as '0' may actually be a
>> valid 'tid' value.
> 
> Actually my question is,  there are some place use '0' to check if a transaction is valid, e.g.
> 
> In ext4_wait_for_tail_page_commit()
> 
> 5218         while (1) {
> 5219                 struct folio *folio = filemap_lock_folio(inode->i_mapping,
> 5220                                       inode->i_size >> PAGE_SHIFT);
> 5221                 if (IS_ERR(folio))
> 5222                         return;
> 5223                 ret = __ext4_journalled_invalidate_folio(folio, offset,
> 5224 folio_size(folio) - offset);
> 5225                 folio_unlock(folio);
> 5226                 folio_put(folio);
> 5227                 if (ret != -EBUSY)
> 5228                         return;
> 5229                 commit_tid = 0;
> 5230                 read_lock(&journal->j_state_lock);
> 5231                 if (journal->j_committing_transaction)
> 5232                         commit_tid = journal->j_committing_transaction->t_tid;
> 5233                 read_unlock(&journal->j_state_lock);
> 5234                 if (commit_tid)
> 5235                         jbd2_log_wait_commit(journal, commit_tid);
> 5236         }
> 5237  We only wait commit if tid is not zero.
> 
> And in __jbd2_log_wait_for_space()
> 
> 79                 if (space_left < nblocks) {
>  80                         int chkpt = journal->j_checkpoint_transactions != NULL;
>  81                         tid_t tid = 0;
>  82
>  83                         if (journal->j_committing_transaction)
>  84                                 tid = journal->j_committing_transaction->t_tid;
>  85 spin_unlock(&journal->j_list_lock);
>  86 write_unlock(&journal->j_state_lock);
>  87                         if (chkpt) {
>  88 jbd2_log_do_checkpoint(journal);
>  89                         } else if (jbd2_cleanup_journal_tail(journal) == 0) {
>  90                                 /* We were able to recover space; yay! */
>  91                                 ;
>  92                         } else if (tid) {
>  93                                 /*
>  94                                  * jbd2_journal_commit_transaction() may want
>  95                                  * to take the checkpoint_mutex if JBD2_FLUSHED
>  96                                  * is set.  So we need to temporarily drop it.
>  97                                  */
>  98 mutex_unlock(&journal->j_checkpoint_mutex);
>  99                                 jbd2_log_wait_commit(journal, tid);
> 100 write_lock(&journal->j_state_lock);
> 101                                 continue;
> We also only wait commit if tid is not zero.
> 
> Does it mean all these have bugs if '0' is a valid 'tid' ?
> 
> But on the other hand, if we don't consider sync and fsync, and default commit interval is 5s,
> 
> time of tid wrap to 0 is nearly 680 years. However, we can run sync/fsync to make tid to increase
> 
> more quickly in real world ?

The simple solution is that "0" is not a valid sequence.  It looks like
this is a bug in jbd2_get_transaction() where it is incrementing the TID:

        transaction->t_tid = journal->j_transaction_sequence++;

it should add a check to handle the wrap-around:

        if (unlikely(transaction->t_tid == 0))
                transaction->t_tid = journal->j_transaction_sequence++;

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-11 19:28       ` Andreas Dilger
@ 2024-07-12  0:51         ` wangjianjian (C)
  2024-07-12  9:15         ` Luis Henriques
  1 sibling, 0 replies; 14+ messages in thread
From: wangjianjian (C) @ 2024-07-12  0:51 UTC (permalink / raw)
  To: Andreas Dilger, Wang Jianjian
  Cc: Luis Henriques, Theodore Ts'o, Jan Kara, Harshad Shirwadkar,
	linux-ext4, linux-kernel

On 2024/7/12 3:28, Andreas Dilger wrote:
> On Jul 11, 2024, at 10:16 AM, Wang Jianjian <wangjianjian0@foxmail.com> wrote:
>> On 2024/7/11 23:16, Luis Henriques wrote:
>>> On Thu, Jul 11 2024, wangjianjian (C) wrote:
>>>
>>>> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote:
>>>>> When a full journal commit is on-going, any fast commit has to be enqueued
>>>>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
>>>>> is done only once, i.e. if an inode is already queued in a previous fast
>>>>> commit entry it won't be enqueued again.  However, if a full commit starts
>>>>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>>>>> be done into FC_Q_STAGING.  And this is not being done in function
>>>>> ext4_fc_track_template().
>>>>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
>>>>> during the fast commit clean-up callback if it has a tid (i_sync_tid)
>>>>> greater than the one being handled.  The STAGING queue will then be spliced
>>>>> back into MAIN.
>>>>> This bug was found using fstest generic/047.  This test creates several 32k
>>>>> bytes files, sync'ing each of them after it's creation, and then shutting
>>>>> down the filesystem.  Some data may be loss in this operation; for example a
>>>>> file may have it's size truncated to zero.
>>>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>>>>> ---
>>>>> Hi!
>>>>> v4 of this patch enqueues the inode into STAGING *only* if the current tid
>>>>> is non-zero.  It will be zero when doing an fc commit, and this would mean
>>>>> to always re-enqueue the inode.  This fixes the regressions caught by Ted
>>>>> in v3 with fstests generic/472 generic/496 generic/643.
>>>>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch
>>>>> to be applied on top of it.
>>>>>    fs/ext4/fast_commit.c | 10 ++++++++++
>>>>>    1 file changed, 10 insertions(+)
>>>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>>>>> index 3926a05eceee..facbc8dbbaa2 100644
>>>>> --- a/fs/ext4/fast_commit.c
>>>>> +++ b/fs/ext4/fast_commit.c
>>>>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>>>>    				       EXT4_STATE_FC_COMMITTING);
>>>>>    		if (tid_geq(tid, iter->i_sync_tid))
>>>>>    			ext4_fc_reset_inode(&iter->vfs_inode);
>>>>> +		} else if (tid) {
>>>>> +			/*
>>>>> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
>>>> one quick question about tid, if one disk is using long time and its tid   get
>>>> wrapped to 0, is it a valid seq? I don't find code handling this situation.
>>> Hmm... OK.  So, to answer to your question, the 'tid' is expected to wrap.
>>> That's why we use:
>>>
>>> 	if (tid_geq(tid, iter->i_sync_tid))
>> Yes, I know this.
>>>
>>> instead of:
>>>
>>> 	if (tid >= iter->i_sync_tid)
>>>
>>> (The second patch in v3 actually fixed a few places where the tid_*()
>>> helpers weren't being used.)
>>>
>>> But your question shows me that my patch is wrong as '0' may actually be a
>>> valid 'tid' value.
>>
>> Actually my question is,  there are some place use '0' to check if a transaction is valid, e.g.
>>
>> In ext4_wait_for_tail_page_commit()
>>
>> 5218         while (1) {
>> 5219                 struct folio *folio = filemap_lock_folio(inode->i_mapping,
>> 5220                                       inode->i_size >> PAGE_SHIFT);
>> 5221                 if (IS_ERR(folio))
>> 5222                         return;
>> 5223                 ret = __ext4_journalled_invalidate_folio(folio, offset,
>> 5224 folio_size(folio) - offset);
>> 5225                 folio_unlock(folio);
>> 5226                 folio_put(folio);
>> 5227                 if (ret != -EBUSY)
>> 5228                         return;
>> 5229                 commit_tid = 0;
>> 5230                 read_lock(&journal->j_state_lock);
>> 5231                 if (journal->j_committing_transaction)
>> 5232                         commit_tid = journal->j_committing_transaction->t_tid;
>> 5233                 read_unlock(&journal->j_state_lock);
>> 5234                 if (commit_tid)
>> 5235                         jbd2_log_wait_commit(journal, commit_tid);
>> 5236         }
>> 5237  We only wait commit if tid is not zero.
>>
>> And in __jbd2_log_wait_for_space()
>>
>> 79                 if (space_left < nblocks) {
>>   80                         int chkpt = journal->j_checkpoint_transactions != NULL;
>>   81                         tid_t tid = 0;
>>   82
>>   83                         if (journal->j_committing_transaction)
>>   84                                 tid = journal->j_committing_transaction->t_tid;
>>   85 spin_unlock(&journal->j_list_lock);
>>   86 write_unlock(&journal->j_state_lock);
>>   87                         if (chkpt) {
>>   88 jbd2_log_do_checkpoint(journal);
>>   89                         } else if (jbd2_cleanup_journal_tail(journal) == 0) {
>>   90                                 /* We were able to recover space; yay! */
>>   91                                 ;
>>   92                         } else if (tid) {
>>   93                                 /*
>>   94                                  * jbd2_journal_commit_transaction() may want
>>   95                                  * to take the checkpoint_mutex if JBD2_FLUSHED
>>   96                                  * is set.  So we need to temporarily drop it.
>>   97                                  */
>>   98 mutex_unlock(&journal->j_checkpoint_mutex);
>>   99                                 jbd2_log_wait_commit(journal, tid);
>> 100 write_lock(&journal->j_state_lock);
>> 101                                 continue;
>> We also only wait commit if tid is not zero.
>>
>> Does it mean all these have bugs if '0' is a valid 'tid' ?
>>
>> But on the other hand, if we don't consider sync and fsync, and default commit interval is 5s,
>>
>> time of tid wrap to 0 is nearly 680 years. However, we can run sync/fsync to make tid to increase
>>
>> more quickly in real world ?
> 
> The simple solution is that "0" is not a valid sequence.  It looks like
> this is a bug in jbd2_get_transaction() where it is incrementing the TID:
> 
>          transaction->t_tid = journal->j_transaction_sequence++;
> 
> it should add a check to handle the wrap-around:
> 
>          if (unlikely(transaction->t_tid == 0))
>                  transaction->t_tid = journal->j_transaction_sequence++;
Yes, I had ever thought about this, just curious why nobody encounter 
problems here.
> 
> Cheers, Andreas
> 
> 
> 
> 
> 
-- 
Regards


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-11 19:28       ` Andreas Dilger
  2024-07-12  0:51         ` wangjianjian (C)
@ 2024-07-12  9:15         ` Luis Henriques
  2024-07-12  9:53           ` [RFC PATCH] jbd2: make '0' an invalid transaction sequence Luis Henriques
  1 sibling, 1 reply; 14+ messages in thread
From: Luis Henriques @ 2024-07-12  9:15 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Wang Jianjian, wangjianjian (C), Theodore Ts'o, Jan Kara,
	Harshad Shirwadkar, linux-ext4, linux-kernel

On Thu, Jul 11 2024, Andreas Dilger wrote:

> On Jul 11, 2024, at 10:16 AM, Wang Jianjian <wangjianjian0@foxmail.com> wrote:
>> On 2024/7/11 23:16, Luis Henriques wrote:
>>> On Thu, Jul 11 2024, wangjianjian (C) wrote:
>>> 
>>>> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote:
>>>>> When a full journal commit is on-going, any fast commit has to be enqueued
>>>>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
>>>>> is done only once, i.e. if an inode is already queued in a previous fast
>>>>> commit entry it won't be enqueued again.  However, if a full commit starts
>>>>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>>>>> be done into FC_Q_STAGING.  And this is not being done in function
>>>>> ext4_fc_track_template().
>>>>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
>>>>> during the fast commit clean-up callback if it has a tid (i_sync_tid)
>>>>> greater than the one being handled.  The STAGING queue will then be spliced
>>>>> back into MAIN.
>>>>> This bug was found using fstest generic/047.  This test creates several 32k
>>>>> bytes files, sync'ing each of them after it's creation, and then shutting
>>>>> down the filesystem.  Some data may be loss in this operation; for example a
>>>>> file may have it's size truncated to zero.
>>>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>>>>> ---
>>>>> Hi!
>>>>> v4 of this patch enqueues the inode into STAGING *only* if the current tid
>>>>> is non-zero.  It will be zero when doing an fc commit, and this would mean
>>>>> to always re-enqueue the inode.  This fixes the regressions caught by Ted
>>>>> in v3 with fstests generic/472 generic/496 generic/643.
>>>>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch
>>>>> to be applied on top of it.
>>>>>   fs/ext4/fast_commit.c | 10 ++++++++++
>>>>>   1 file changed, 10 insertions(+)
>>>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>>>>> index 3926a05eceee..facbc8dbbaa2 100644
>>>>> --- a/fs/ext4/fast_commit.c
>>>>> +++ b/fs/ext4/fast_commit.c
>>>>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>>>>   				       EXT4_STATE_FC_COMMITTING);
>>>>>   		if (tid_geq(tid, iter->i_sync_tid))
>>>>>   			ext4_fc_reset_inode(&iter->vfs_inode);
>>>>> +		} else if (tid) {
>>>>> +			/*
>>>>> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
>>>> one quick question about tid, if one disk is using long time and its tid   get
>>>> wrapped to 0, is it a valid seq? I don't find code handling this situation.
>>> Hmm... OK.  So, to answer to your question, the 'tid' is expected to wrap.
>>> That's why we use:
>>> 
>>> 	if (tid_geq(tid, iter->i_sync_tid))
>> Yes, I know this.
>>> 
>>> instead of:
>>> 
>>> 	if (tid >= iter->i_sync_tid)
>>> 
>>> (The second patch in v3 actually fixed a few places where the tid_*()
>>> helpers weren't being used.)
>>> 
>>> But your question shows me that my patch is wrong as '0' may actually be a
>>> valid 'tid' value.
>> 
>> Actually my question is,  there are some place use '0' to check if a transaction is valid, e.g.
>> 
>> In ext4_wait_for_tail_page_commit()
>> 
>> 5218         while (1) {
>> 5219                 struct folio *folio = filemap_lock_folio(inode->i_mapping,
>> 5220                                       inode->i_size >> PAGE_SHIFT);
>> 5221                 if (IS_ERR(folio))
>> 5222                         return;
>> 5223                 ret = __ext4_journalled_invalidate_folio(folio, offset,
>> 5224 folio_size(folio) - offset);
>> 5225                 folio_unlock(folio);
>> 5226                 folio_put(folio);
>> 5227                 if (ret != -EBUSY)
>> 5228                         return;
>> 5229                 commit_tid = 0;
>> 5230                 read_lock(&journal->j_state_lock);
>> 5231                 if (journal->j_committing_transaction)
>> 5232                         commit_tid = journal->j_committing_transaction->t_tid;
>> 5233                 read_unlock(&journal->j_state_lock);
>> 5234                 if (commit_tid)
>> 5235                         jbd2_log_wait_commit(journal, commit_tid);
>> 5236         }
>> 5237  We only wait commit if tid is not zero.
>> 
>> And in __jbd2_log_wait_for_space()
>> 
>> 79                 if (space_left < nblocks) {
>>  80                         int chkpt = journal->j_checkpoint_transactions != NULL;
>>  81                         tid_t tid = 0;
>>  82
>>  83                         if (journal->j_committing_transaction)
>>  84                                 tid = journal->j_committing_transaction->t_tid;
>>  85 spin_unlock(&journal->j_list_lock);
>>  86 write_unlock(&journal->j_state_lock);
>>  87                         if (chkpt) {
>>  88 jbd2_log_do_checkpoint(journal);
>>  89                         } else if (jbd2_cleanup_journal_tail(journal) == 0) {
>>  90                                 /* We were able to recover space; yay! */
>>  91                                 ;
>>  92                         } else if (tid) {
>>  93                                 /*
>>  94                                  * jbd2_journal_commit_transaction() may want
>>  95                                  * to take the checkpoint_mutex if JBD2_FLUSHED
>>  96                                  * is set.  So we need to temporarily drop it.
>>  97                                  */
>>  98 mutex_unlock(&journal->j_checkpoint_mutex);
>>  99                                 jbd2_log_wait_commit(journal, tid);
>> 100 write_lock(&journal->j_state_lock);
>> 101                                 continue;
>> We also only wait commit if tid is not zero.
>> 
>> Does it mean all these have bugs if '0' is a valid 'tid' ?
>> 
>> But on the other hand, if we don't consider sync and fsync, and default commit interval is 5s,
>> 
>> time of tid wrap to 0 is nearly 680 years. However, we can run sync/fsync to make tid to increase
>> 
>> more quickly in real world ?
>
> The simple solution is that "0" is not a valid sequence.  It looks like
> this is a bug in jbd2_get_transaction() where it is incrementing the TID:
>
>         transaction->t_tid = journal->j_transaction_sequence++;
>
> it should add a check to handle the wrap-around:
>
>         if (unlikely(transaction->t_tid == 0))
>                 transaction->t_tid = journal->j_transaction_sequence++;

Sound good to me.  I can prepare a patch with this change if no one else
sees other issues.  As far as I can see, this shouldn't be a problem even
when replaying journals that still have a '0' tid.

Thanks, Andreas.  And thanks Wang, for spotting this.

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH] jbd2: make '0' an invalid transaction sequence
  2024-07-12  9:15         ` Luis Henriques
@ 2024-07-12  9:53           ` Luis Henriques
  2024-07-12 10:04             ` wangjianjian (C)
  2024-07-16  9:52             ` Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Luis Henriques @ 2024-07-12  9:53 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Wang Jianjian, wangjianjian (C), Theodore Ts'o, Jan Kara,
	Harshad Shirwadkar, linux-ext4, linux-kernel

Since there's code (in fast-commit) that already handles a '0' tid as a
special case, it's better to ensure that jbd2 never sets it to that value
when journal->j_transaction_sequence increment wraps.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
---
 fs/jbd2/transaction.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 66513c18ca29..4dbdd37349c3 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -84,6 +84,8 @@ static void jbd2_get_transaction(journal_t *journal,
 	transaction->t_state = T_RUNNING;
 	transaction->t_start_time = ktime_get();
 	transaction->t_tid = journal->j_transaction_sequence++;
+	if (unlikely(transaction->t_tid == 0))
+		transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	atomic_set(&transaction->t_updates, 0);
 	atomic_set(&transaction->t_outstanding_credits,

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] jbd2: make '0' an invalid transaction sequence
  2024-07-12  9:53           ` [RFC PATCH] jbd2: make '0' an invalid transaction sequence Luis Henriques
@ 2024-07-12 10:04             ` wangjianjian (C)
  2024-07-12 10:28               ` wangjianjian (C)
  2024-07-16  9:52             ` Jan Kara
  1 sibling, 1 reply; 14+ messages in thread
From: wangjianjian (C) @ 2024-07-12 10:04 UTC (permalink / raw)
  To: Luis Henriques, Andreas Dilger
  Cc: Wang Jianjian, Theodore Ts'o, Jan Kara, Harshad Shirwadkar,
	linux-ext4, linux-kernel

On 2024/7/12 17:53, Luis Henriques wrote:
> Since there's code (in fast-commit) that already handles a '0' tid as a
> special case, it's better to ensure that jbd2 never sets it to that value
> when journal->j_transaction_sequence increment wraps.
> 
> Suggested-by: Andreas Dilger <adilger@dilger.ca>
> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
> ---
>   fs/jbd2/transaction.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 66513c18ca29..4dbdd37349c3 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -84,6 +84,8 @@ static void jbd2_get_transaction(journal_t *journal,
>   	transaction->t_state = T_RUNNING;
>   	transaction->t_start_time = ktime_get();
>   	transaction->t_tid = journal->j_transaction_sequence++;
> +	if (unlikely(transaction->t_tid == 0))
> +		transaction->t_tid = journal->j_transaction_sequence++;
Do we need add j_transaction_sequence again here? if tansanction->t_tid 
== 0, then journal->j_trnasaction_sequence must be 1

>   	transaction->t_expires = jiffies + journal->j_commit_interval;
>   	atomic_set(&transaction->t_updates, 0);
>   	atomic_set(&transaction->t_outstanding_credits,
-- 
Regards


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] jbd2: make '0' an invalid transaction sequence
  2024-07-12 10:04             ` wangjianjian (C)
@ 2024-07-12 10:28               ` wangjianjian (C)
  0 siblings, 0 replies; 14+ messages in thread
From: wangjianjian (C) @ 2024-07-12 10:28 UTC (permalink / raw)
  To: Luis Henriques, Andreas Dilger
  Cc: Wang Jianjian, Theodore Ts'o, Jan Kara, Harshad Shirwadkar,
	linux-ext4, linux-kernel

On 2024/7/12 18:04, wangjianjian (C) wrote:
> On 2024/7/12 17:53, Luis Henriques wrote:
>> Since there's code (in fast-commit) that already handles a '0' tid as a
>> special case, it's better to ensure that jbd2 never sets it to that value
>> when journal->j_transaction_sequence increment wraps.
>>
>> Suggested-by: Andreas Dilger <adilger@dilger.ca>
>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>> ---
>>   fs/jbd2/transaction.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 66513c18ca29..4dbdd37349c3 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -84,6 +84,8 @@ static void jbd2_get_transaction(journal_t *journal,
>>       transaction->t_state = T_RUNNING;
>>       transaction->t_start_time = ktime_get();
>>       transaction->t_tid = journal->j_transaction_sequence++;
>> +    if (unlikely(transaction->t_tid == 0))
>> +        transaction->t_tid = journal->j_transaction_sequence++;
> Do we need add j_transaction_sequence again here? if tansanction->t_tid 
> == 0, then journal->j_trnasaction_sequence must be 1
still need add 1. Sorry for confuse.
> 
>>       transaction->t_expires = jiffies + journal->j_commit_interval;
>>       atomic_set(&transaction->t_updates, 0);
>>       atomic_set(&transaction->t_outstanding_credits,
-- 
Regards


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] jbd2: make '0' an invalid transaction sequence
  2024-07-12  9:53           ` [RFC PATCH] jbd2: make '0' an invalid transaction sequence Luis Henriques
  2024-07-12 10:04             ` wangjianjian (C)
@ 2024-07-16  9:52             ` Jan Kara
  2024-07-16 13:11               ` Luis Henriques
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2024-07-16  9:52 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Andreas Dilger, Wang Jianjian, wangjianjian (C),
	Theodore Ts'o, Jan Kara, Harshad Shirwadkar, linux-ext4,
	linux-kernel

On Fri 12-07-24 10:53:02, Luis Henriques wrote:
> Since there's code (in fast-commit) that already handles a '0' tid as a
> special case, it's better to ensure that jbd2 never sets it to that value
> when journal->j_transaction_sequence increment wraps.
> 
> Suggested-by: Andreas Dilger <adilger@dilger.ca>
> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>

Well, sadly it isn't so simple. If nothing else, journal replay
(do_one_pass()) will get broken by the skipped tid as we do check:

                if (sequence != next_commit_ID) {
                        brelse(bh);
                        break;
                }

So we'd abort journal replay too early. Secondly, there's also code
handling journal replay in libext2fs which would need to be checked and
fixed up. Finally, I've found code in mballoc which alternates between two
lists based on tid & 1, so this logic would get broken by skipping 0 tid
as well.

Overall, I think we might be better off to go and fix places that assume
tid 0 is not valid. I can see those assumptions in:

ext4_fc_mark_ineligible()
ext4_wait_for_tail_page_commit()
__jbd2_log_wait_for_space()
jbd2_journal_shrink_checkpoint_list()

Now I don't see it as urgent to fix all these right now. Just for this
series let's not add another place making tid 0 special. Later we can fixup
the other places...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] jbd2: make '0' an invalid transaction sequence
  2024-07-16  9:52             ` Jan Kara
@ 2024-07-16 13:11               ` Luis Henriques
  0 siblings, 0 replies; 14+ messages in thread
From: Luis Henriques @ 2024-07-16 13:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Wang Jianjian, wangjianjian (C),
	Theodore Ts'o, Harshad Shirwadkar, linux-ext4, linux-kernel

On Tue, Jul 16 2024, Jan Kara wrote:

> On Fri 12-07-24 10:53:02, Luis Henriques wrote:
>> Since there's code (in fast-commit) that already handles a '0' tid as a
>> special case, it's better to ensure that jbd2 never sets it to that value
>> when journal->j_transaction_sequence increment wraps.
>> 
>> Suggested-by: Andreas Dilger <adilger@dilger.ca>
>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>
> Well, sadly it isn't so simple. If nothing else, journal replay
> (do_one_pass()) will get broken by the skipped tid as we do check:
>
>                 if (sequence != next_commit_ID) {
>                         brelse(bh);
>                         break;
>                 }
>
> So we'd abort journal replay too early. Secondly, there's also code
> handling journal replay in libext2fs which would need to be checked and
> fixed up. Finally, I've found code in mballoc which alternates between two
> lists based on tid & 1, so this logic would get broken by skipping 0 tid
> as well.
>
> Overall, I think we might be better off to go and fix places that assume
> tid 0 is not valid. I can see those assumptions in:
>
> ext4_fc_mark_ineligible()
> ext4_wait_for_tail_page_commit()
> __jbd2_log_wait_for_space()
> jbd2_journal_shrink_checkpoint_list()
>
> Now I don't see it as urgent to fix all these right now. Just for this
> series let's not add another place making tid 0 special. Later we can fixup
> the other places...

Yikes!  Looks like I haven't done my homework -- I should have caught at
least one of the three breakages you point out.  Obviously, because I've
seen this assumption in different places, I thought it would be OK.

Anyway, thanks a lot for point this out, Jan.  I'll add a new TODO to my
list to start looking at other places that need to be fixed.

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-11  8:35 [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Luis Henriques (SUSE)
  2024-07-11 13:32 ` wangjianjian (C)
@ 2024-07-16 10:24 ` Jan Kara
  2024-07-16 14:13   ` Luis Henriques
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2024-07-16 10:24 UTC (permalink / raw)
  To: Luis Henriques (SUSE)
  Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Harshad Shirwadkar,
	linux-ext4, linux-kernel

On Thu 11-07-24 09:35:20, Luis Henriques (SUSE) wrote:
> When a full journal commit is on-going, any fast commit has to be enqueued
> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
> is done only once, i.e. if an inode is already queued in a previous fast
> commit entry it won't be enqueued again.  However, if a full commit starts
> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
> be done into FC_Q_STAGING.  And this is not being done in function
> ext4_fc_track_template().
> 
> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
> during the fast commit clean-up callback if it has a tid (i_sync_tid)
> greater than the one being handled.  The STAGING queue will then be spliced
> back into MAIN.
> 
> This bug was found using fstest generic/047.  This test creates several 32k
> bytes files, sync'ing each of them after it's creation, and then shutting
> down the filesystem.  Some data may be loss in this operation; for example a
> file may have it's size truncated to zero.
> 
> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>

...

> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 3926a05eceee..facbc8dbbaa2 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>  				       EXT4_STATE_FC_COMMITTING);
>  		if (tid_geq(tid, iter->i_sync_tid))
>  			ext4_fc_reset_inode(&iter->vfs_inode);
> +		} else if (tid) {
> +			/*
> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
> +			 * inode into STAGING, which will then be splice back
> +			 * into MAIN
> +			 */
> +			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
> +				      &sbi->s_fc_q[FC_Q_STAGING]);
> +		}

I don't think this is going to work (even if we fix the tid 0 being special
assumption). With this there would be a race like:

Task 1					Task2
modify inode I
ext4_fc_commit()
  jbd2_fc_begin_commit()
  commits changes
  jbd2_fc_end_commit()
    __jbd2_fc_end_commit(journal, 0, false)
      jbd2_journal_unlock_updates(journal)
					jbd2_journal_start()
					modify inode I
					...
					ext4_mark_iloc_dirty()
					  ext4_fc_track_inode()
					    ext4_fc_track_template()
					      - doesn't add inode anywhere
					      because i_fc_list is not empty
      ext4_fc_cleanup(journal, 0, 0)
        removes inode I from i_fc_list => next fastcommit will not properly
flush it.

To avoid this race I think we could move the
journal->j_fc_cleanup_callback() call to happen before we call
jbd2_journal_unlock_updates(). Then we are sure that inode cannot be
modified (journal is locked) until we are done processing the fastcommit
lists when doing fastcommit. Hence your patch could then be changed like:

+		} else if (full) {
+			/*
+			 * We are called after a full commit, inode has been
+			 * modified while the commit was running. Re-enqueue
+			 * the inode into STAGING, which will then be splice
+			 * back into MAIN. This cannot happen during
+			 * fastcommit because the journal is locked all the
+			 * time in that case (and tid doesn't increase so
+			 * tid check above isn't reliable).
+			 */
+			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
+				      &sbi->s_fc_q[FC_Q_STAGING]);
+		}

Later, Harshad's patches change the code to use EXT4_STATE_FC_COMMITTING
for protecting inodes during fastcommit and that will also deal with these
races without having to keep the whole journal locked.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
  2024-07-16 10:24 ` [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Jan Kara
@ 2024-07-16 14:13   ` Luis Henriques
  0 siblings, 0 replies; 14+ messages in thread
From: Luis Henriques @ 2024-07-16 14:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Luis Henriques (SUSE), Theodore Ts'o, Andreas Dilger,
	Harshad Shirwadkar, linux-ext4, linux-kernel

On Tue, Jul 16 2024, Jan Kara wrote:

> On Thu 11-07-24 09:35:20, Luis Henriques (SUSE) wrote:
>> When a full journal commit is on-going, any fast commit has to be enqueued
>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
>> is done only once, i.e. if an inode is already queued in a previous fast
>> commit entry it won't be enqueued again.  However, if a full commit starts
>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>> be done into FC_Q_STAGING.  And this is not being done in function
>> ext4_fc_track_template().
>> 
>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
>> during the fast commit clean-up callback if it has a tid (i_sync_tid)
>> greater than the one being handled.  The STAGING queue will then be spliced
>> back into MAIN.
>> 
>> This bug was found using fstest generic/047.  This test creates several 32k
>> bytes files, sync'ing each of them after it's creation, and then shutting
>> down the filesystem.  Some data may be loss in this operation; for example a
>> file may have it's size truncated to zero.
>> 
>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>
> ...
>
>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>> index 3926a05eceee..facbc8dbbaa2 100644
>> --- a/fs/ext4/fast_commit.c
>> +++ b/fs/ext4/fast_commit.c
>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>  				       EXT4_STATE_FC_COMMITTING);
>>  		if (tid_geq(tid, iter->i_sync_tid))
>>  			ext4_fc_reset_inode(&iter->vfs_inode);
>> +		} else if (tid) {
>> +			/*
>> +			 * If the tid is valid (i.e. non-zero) re-enqueue the
>> +			 * inode into STAGING, which will then be splice back
>> +			 * into MAIN
>> +			 */
>> +			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
>> +				      &sbi->s_fc_q[FC_Q_STAGING]);
>> +		}
>
> I don't think this is going to work (even if we fix the tid 0 being special
> assumption). With this there would be a race like:
>
> Task 1					Task2
> modify inode I
> ext4_fc_commit()
>   jbd2_fc_begin_commit()
>   commits changes
>   jbd2_fc_end_commit()
>     __jbd2_fc_end_commit(journal, 0, false)
>       jbd2_journal_unlock_updates(journal)
> 					jbd2_journal_start()
> 					modify inode I
> 					...
> 					ext4_mark_iloc_dirty()
> 					  ext4_fc_track_inode()
> 					    ext4_fc_track_template()
> 					      - doesn't add inode anywhere
> 					      because i_fc_list is not empty
>       ext4_fc_cleanup(journal, 0, 0)
>         removes inode I from i_fc_list => next fastcommit will not properly
> flush it.
>
> To avoid this race I think we could move the
> journal->j_fc_cleanup_callback() call to happen before we call
> jbd2_journal_unlock_updates(). Then we are sure that inode cannot be
> modified (journal is locked) until we are done processing the fastcommit
> lists when doing fastcommit. Hence your patch could then be changed like:
>
> +		} else if (full) {
> +			/*
> +			 * We are called after a full commit, inode has been
> +			 * modified while the commit was running. Re-enqueue
> +			 * the inode into STAGING, which will then be splice
> +			 * back into MAIN. This cannot happen during
> +			 * fastcommit because the journal is locked all the
> +			 * time in that case (and tid doesn't increase so
> +			 * tid check above isn't reliable).
> +			 */
> +			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
> +				      &sbi->s_fc_q[FC_Q_STAGING]);
> +		}
>
> Later, Harshad's patches change the code to use EXT4_STATE_FC_COMMITTING
> for protecting inodes during fastcommit and that will also deal with these
> races without having to keep the whole journal locked.

OK, this looks like it should fix all the issues I was trying to fix
(g/047, g/472, and a few others Ted pointed out).  I'll go run a few more
tests on this to try to catch any possible regression.

Once again, thanks a lot for your help, Jan.

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-07-16 14:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11  8:35 [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Luis Henriques (SUSE)
2024-07-11 13:32 ` wangjianjian (C)
2024-07-11 15:16   ` Luis Henriques
2024-07-11 16:16     ` Wang Jianjian
2024-07-11 19:28       ` Andreas Dilger
2024-07-12  0:51         ` wangjianjian (C)
2024-07-12  9:15         ` Luis Henriques
2024-07-12  9:53           ` [RFC PATCH] jbd2: make '0' an invalid transaction sequence Luis Henriques
2024-07-12 10:04             ` wangjianjian (C)
2024-07-12 10:28               ` wangjianjian (C)
2024-07-16  9:52             ` Jan Kara
2024-07-16 13:11               ` Luis Henriques
2024-07-16 10:24 ` [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Jan Kara
2024-07-16 14:13   ` Luis Henriques

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox