From: Luis Henriques <luis.henriques@linux.dev>
To: Andreas Dilger <adilger@dilger.ca>
Cc: Wang Jianjian <wangjianjian0@foxmail.com>,
"wangjianjian (C)" <wangjianjian3@huawei.com>,
Theodore Ts'o <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
Harshad Shirwadkar <harshadshirwadkar@gmail.com>,
linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit
Date: Fri, 12 Jul 2024 10:15:04 +0100 [thread overview]
Message-ID: <87ed7znf8n.fsf@linux.dev> (raw)
In-Reply-To: <A90C7898-B704-4B2A-BFE6-4A32050763F0@dilger.ca> (Andreas Dilger's message of "Thu, 11 Jul 2024 13:28:23 -0600")
On Thu, Jul 11 2024, Andreas Dilger wrote:
> On Jul 11, 2024, at 10:16 AM, Wang Jianjian <wangjianjian0@foxmail.com> wrote:
>> On 2024/7/11 23:16, Luis Henriques wrote:
>>> On Thu, Jul 11 2024, wangjianjian (C) wrote:
>>>
>>>> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote:
>>>>> When a full journal commit is on-going, any fast commit has to be enqueued
>>>>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing
>>>>> is done only once, i.e. if an inode is already queued in a previous fast
>>>>> commit entry it won't be enqueued again. However, if a full commit starts
>>>>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>>>>> be done into FC_Q_STAGING. And this is not being done in function
>>>>> ext4_fc_track_template().
>>>>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue
>>>>> during the fast commit clean-up callback if it has a tid (i_sync_tid)
>>>>> greater than the one being handled. The STAGING queue will then be spliced
>>>>> back into MAIN.
>>>>> This bug was found using fstest generic/047. This test creates several 32k
>>>>> bytes files, sync'ing each of them after it's creation, and then shutting
>>>>> down the filesystem. Some data may be loss in this operation; for example a
>>>>> file may have it's size truncated to zero.
>>>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>>>>> ---
>>>>> Hi!
>>>>> v4 of this patch enqueues the inode into STAGING *only* if the current tid
>>>>> is non-zero. It will be zero when doing an fc commit, and this would mean
>>>>> to always re-enqueue the inode. This fixes the regressions caught by Ted
>>>>> in v3 with fstests generic/472 generic/496 generic/643.
>>>>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch
>>>>> to be applied on top of it.
>>>>> fs/ext4/fast_commit.c | 10 ++++++++++
>>>>> 1 file changed, 10 insertions(+)
>>>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>>>>> index 3926a05eceee..facbc8dbbaa2 100644
>>>>> --- a/fs/ext4/fast_commit.c
>>>>> +++ b/fs/ext4/fast_commit.c
>>>>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>>>> EXT4_STATE_FC_COMMITTING);
>>>>> if (tid_geq(tid, iter->i_sync_tid))
>>>>> ext4_fc_reset_inode(&iter->vfs_inode);
>>>>> + } else if (tid) {
>>>>> + /*
>>>>> + * If the tid is valid (i.e. non-zero) re-enqueue the
>>>> one quick question about tid, if one disk is using long time and its tid get
>>>> wrapped to 0, is it a valid seq? I don't find code handling this situation.
>>> Hmm... OK. So, to answer to your question, the 'tid' is expected to wrap.
>>> That's why we use:
>>>
>>> if (tid_geq(tid, iter->i_sync_tid))
>> Yes, I know this.
>>>
>>> instead of:
>>>
>>> if (tid >= iter->i_sync_tid)
>>>
>>> (The second patch in v3 actually fixed a few places where the tid_*()
>>> helpers weren't being used.)
>>>
>>> But your question shows me that my patch is wrong as '0' may actually be a
>>> valid 'tid' value.
>>
>> Actually my question is, there are some place use '0' to check if a transaction is valid, e.g.
>>
>> In ext4_wait_for_tail_page_commit()
>>
>> 5218 while (1) {
>> 5219 struct folio *folio = filemap_lock_folio(inode->i_mapping,
>> 5220 inode->i_size >> PAGE_SHIFT);
>> 5221 if (IS_ERR(folio))
>> 5222 return;
>> 5223 ret = __ext4_journalled_invalidate_folio(folio, offset,
>> 5224 folio_size(folio) - offset);
>> 5225 folio_unlock(folio);
>> 5226 folio_put(folio);
>> 5227 if (ret != -EBUSY)
>> 5228 return;
>> 5229 commit_tid = 0;
>> 5230 read_lock(&journal->j_state_lock);
>> 5231 if (journal->j_committing_transaction)
>> 5232 commit_tid = journal->j_committing_transaction->t_tid;
>> 5233 read_unlock(&journal->j_state_lock);
>> 5234 if (commit_tid)
>> 5235 jbd2_log_wait_commit(journal, commit_tid);
>> 5236 }
>> 5237 We only wait commit if tid is not zero.
>>
>> And in __jbd2_log_wait_for_space()
>>
>> 79 if (space_left < nblocks) {
>> 80 int chkpt = journal->j_checkpoint_transactions != NULL;
>> 81 tid_t tid = 0;
>> 82
>> 83 if (journal->j_committing_transaction)
>> 84 tid = journal->j_committing_transaction->t_tid;
>> 85 spin_unlock(&journal->j_list_lock);
>> 86 write_unlock(&journal->j_state_lock);
>> 87 if (chkpt) {
>> 88 jbd2_log_do_checkpoint(journal);
>> 89 } else if (jbd2_cleanup_journal_tail(journal) == 0) {
>> 90 /* We were able to recover space; yay! */
>> 91 ;
>> 92 } else if (tid) {
>> 93 /*
>> 94 * jbd2_journal_commit_transaction() may want
>> 95 * to take the checkpoint_mutex if JBD2_FLUSHED
>> 96 * is set. So we need to temporarily drop it.
>> 97 */
>> 98 mutex_unlock(&journal->j_checkpoint_mutex);
>> 99 jbd2_log_wait_commit(journal, tid);
>> 100 write_lock(&journal->j_state_lock);
>> 101 continue;
>> We also only wait commit if tid is not zero.
>>
>> Does it mean all these have bugs if '0' is a valid 'tid' ?
>>
>> But on the other hand, if we don't consider sync and fsync, and default commit interval is 5s,
>>
>> time of tid wrap to 0 is nearly 680 years. However, we can run sync/fsync to make tid to increase
>>
>> more quickly in real world ?
>
> The simple solution is that "0" is not a valid sequence. It looks like
> this is a bug in jbd2_get_transaction() where it is incrementing the TID:
>
> transaction->t_tid = journal->j_transaction_sequence++;
>
> it should add a check to handle the wrap-around:
>
> if (unlikely(transaction->t_tid == 0))
> transaction->t_tid = journal->j_transaction_sequence++;
Sound good to me. I can prepare a patch with this change if no one else
sees other issues. As far as I can see, this shouldn't be a problem even
when replaying journals that still have a '0' tid.
Thanks, Andreas. And thanks Wang, for spotting this.
Cheers,
--
Luís
next prev parent reply other threads:[~2024-07-12 9:15 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-11 8:35 [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Luis Henriques (SUSE)
2024-07-11 13:32 ` wangjianjian (C)
2024-07-11 15:16 ` Luis Henriques
2024-07-11 16:16 ` Wang Jianjian
2024-07-11 19:28 ` Andreas Dilger
2024-07-12 0:51 ` wangjianjian (C)
2024-07-12 9:15 ` Luis Henriques [this message]
2024-07-12 9:53 ` [RFC PATCH] jbd2: make '0' an invalid transaction sequence Luis Henriques
2024-07-12 10:04 ` wangjianjian (C)
2024-07-12 10:28 ` wangjianjian (C)
2024-07-16 9:52 ` Jan Kara
2024-07-16 13:11 ` Luis Henriques
2024-07-16 10:24 ` [PATCH v4] ext4: fix fast commit inode enqueueing during a full journal commit Jan Kara
2024-07-16 14:13 ` Luis Henriques
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ed7znf8n.fsf@linux.dev \
--to=luis.henriques@linux.dev \
--cc=adilger@dilger.ca \
--cc=harshadshirwadkar@gmail.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=wangjianjian0@foxmail.com \
--cc=wangjianjian3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox