All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Zhang Yi <yi.zhang@huaweicloud.com>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>, Baokun Li <libaokun1@huawei.com>,
	linux-kernel@vger.kernel.org,
	Mahesh Kumar <maheshkumar657g@gmail.com>,
	linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
Subject: Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
Date: Sat, 08 Mar 2025 15:31:52 +0530	[thread overview]
Message-ID: <87jz8zhmcv.fsf@gmail.com> (raw)
In-Reply-To: <4f61a0fb-dd9f-4c0e-b872-31e5474ac799@huaweicloud.com>

Zhang Yi <yi.zhang@huaweicloud.com> writes:

> On 2025/3/8 1:26, Ojaswin Mujoo wrote:
>> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>>> destroyed eg:
>>>>>>>>>
>>>>>>>>> (umount)
>>>>>>>>> ext4_kill_sb
>>>>>>>>>   kill_block_super
>>>>>>>>>     generic_shutdown_super
>>>>>>>>>       sync_filesystem /* commits all txns */
>>>>>>>>>       evict_inodes
>>>>>>>>>         /* might start a new txn */
>>>>>>>>>       ext4_put_super
>>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>>         jbd2_journal_destroy
>>>>>>>>>           journal_kill_thread
>>>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>>           jbd2_journal_commit_transaction
>>>>>>>>>             jbd2_journal_get_descriptor_buffer
>>>>>>>>>               jbd2_journal_bmap
>>>>>>>>>                 ext4_journal_bmap
>>>>>>>>>                   ext4_map_blocks
>>>>>>>>>                     ...
>>>>>>>>>                     ext4_inode_error
>>>>>>>>>                       ext4_handle_error
>>>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>>
>>>>>>>>>                                                /* work queue kicks in */
>>>>>>>>>                                                update_super_work
>>>>>>>>>                                                  jbd2_journal_start
>>>>>>>>>                                                    start_this_handle
>>>>>>>>>                                                      BUG_ON(journal->j_flags &
>>>>>>>>>                                                             JBD2_UNMOUNT)
>>>>>>>>>
>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>>
>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>>> 2d01ddc86606).
>>>>>>>>>
>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>>
>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
>>>>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>>>>>>>> ---
>>>>>>>>>  fs/ext4/ext4.h      | 2 ++
>>>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>>  fs/ext4/super.c     | 4 +++-
>>>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>>  	 */
>>>>>>>>>  	struct work_struct s_sb_upd_work;
>>>>>>>>>  
>>>>>>>>> +	bool s_journal_destorying;
>>>>>>>>> +
>>>>>>>>>  	/* Atomic write unit values in bytes */
>>>>>>>>>  	unsigned int s_awu_min;
>>>>>>>>>  	unsigned int s_awu_max;
>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>>  {
>>>>>>>>>  	int err = 0;
>>>>>>>>>  
>>>>>>>>> +	/*
>>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>>> +	 * than going via journal.
>>>>>>>>> +	 */
>>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Hi, Ojaswin!
>>>>>>>>
>>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>>> occur.
>>>>>>>>
>>>>>>>>  ext4_put_super()
>>>>>>>>   flush_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>>                     **kjournald2**
>>>>>>>>                     jbd2_journal_commit_transaction()
>>>>>>>>                     ...
>>>>>>>>                     ext4_inode_error()
>>>>>>>>                       /* JBD2_UNMOUNT not set */
>>>>>>>>                       schedule_work(s_sb_upd_work)
>>>>>>>>
>>>>>>>>                                   **workqueue**
>>>>>>>>                                    update_super_work
>>>>>>>>                                    /* s_journal_destorying is not set */
>>>>>>>>                             	   if (journal && !s_journal_destorying)
>>>>>>>>
>>>>>>>>   ext4_journal_destroy()
>>>>>>>>    /* set s_journal_destorying */
>>>>>>>>    sbi->s_journal_destorying = true;
>>>>>>>>    jbd2_journal_destroy()
>>>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>
>>>>>>>>                                        jbd2_journal_start()
>>>>>>>>                                         start_this_handle()
>>>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yi.
>>>>>>> Hi Yi,
>>>>>>>
>>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>>
>>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>>> helper. Just wondering if I should still let it be in
>>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>>
>>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>>
>>>>>>> Regards,
>>>>>>> ojaswin
>>>>>>
>>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>>> it?
>>>>>>
>>>>>>   ext4_put_super()
>>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>>  
>>>>>>                      **kjournald2**
>>>>>>                      jbd2_journal_commit_transaction()
>>>>>>                      ...
>>>>>>                      ext4_inode_error()
>>>>>>                        /* JBD2_UNMOUNT not set */
>>>>>>                        schedule_work(s_sb_upd_work)
>>>>>>  
>>>>>>                                     **workqueue**
>>>>>>                                     update_super_work
>>>>>>                                     /* s_journal_destorying is not set */
>>>>>>                              	      if (journal && !s_journal_destorying)
>>>>>>  
>>>>>>    ext4_journal_destroy()
>>>>>>     /* set s_journal_destorying */
>>>>>>     sbi->s_journal_destorying = true;
>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>                                       schedule_work()
>>>>>                                         ^^^^^^^^^^^^^^^
>>>>>                                         where does this come from?
>>>>>
>>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>>> not commit transaction or call schedule_work() again because it has
>>>>> been aborted due to the previous error. Am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Yi.
>>>>
>>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
>>>>
>>>>  if(journal && !is_journal_destroying) 
>>>>
>>>> is computed but schedule_work() not called yet, which is possible cause
>>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>>> is only called after we have done the flush then we end up with this:
>>>>
>>>>                               	      if (journal && !s_journal_destorying)
>>>>     ext4_journal_destroy()
>>>>      /* set s_journal_destorying */
>>>>      sbi->s_journal_destorying = true;
>>>>      flush_work(&sbi->s_sb_upd_work)
>>>>                                        schedule_work()
>>>>
>>>> Which is possible IMO, although the window is tiny.
>>>
>>> Yeah, right!
>>> Sorry for misread the location where you add the "!s_journal_destorying"
>>> check, the graph I provided was in update_super_work(), which was wrong.
>> 
>> Oh right, I also misread your trace but yes as discussed, even 
>> 
>>     sbi->s_journal_destorying = true;
>> 		flush_work()
>>     jbd2_journal_destroy()
>> 
>> doesn't work.
>> 
>>> The right one should be:
>>>
>>>  ext4_put_super()
>>>   flush_work(&sbi->s_sb_upd_work)
>>>
>>>                     **kjournald2**
>>>                     jbd2_journal_commit_transaction()
>>>                     ...
>>>                     ext4_inode_error()
>>>                       /* s_journal_destorying is not set */
>>>                       if (journal && !s_journal_destorying)
>>>                         (schedule_work(s_sb_upd_work))  //can be here
>>>
>>>   ext4_journal_destroy()
>>>    /* set s_journal_destorying */
>>>    sbi->s_journal_destorying = true;
>>>    jbd2_journal_destroy()
>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>
>>>                         (schedule_work(s_sb_upd_work))  //also can be here
>>>
>>>                                   **workqueue**
>>>                                    update_super_work()
>>>                                    journal = sbi->s_journal //get journal
>>>     kfree(journal)
>>>                                      jbd2_journal_start(journal) //journal UAF
>>>                                        start_this_handle()
>>>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
>>>
>>>
>>> So there are two problems here, the first one is the 'journal' UAF,
>>> the second one is triggering JBD2_UNMOUNT flag BUGON.
>> 
>> Indeed, there's a possible UAF here as well.
>> 
>>>
>>>>>>
>>>>>> As for the fix, how about we do something like this:
>>>>>>
>>>>>>   ext4_put_super()
>>>>>>
>>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>>
>>>>>>    ext4_journal_destroy()
>>>>>>     /* set s_journal_destorying */
>>>>>>     sbi->s_journal_destorying = true;
>>>>>>
>>>>>>    /* trigger a commit and wait for it to complete */
>>>>>>
>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>
>>>>>>     jbd2_journal_destroy()
>>>>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>>>>  
>>>>>>                                         jbd2_journal_start()
>>>>>>                                          start_this_handle()
>>>>>>                                            BUG_ON(JBD2_UNMOUNT)
>>>>>>
>>>>>> Still giving this codepath some thought but seems like this might just
>>>>>> be enough to fix the race. Thoughts on this?
>>>>>>
>>>
>>> I think this solution should work, the forced commit and flush_work()
>>> should ensure that the last transaction is committed and that the
>>> potential work is done.
>>>
>>> Besides, the s_journal_destorying flag is set and check concurrently
>>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>>> about adding a new flag into sbi->s_mount_state instead of adding
>>> new s_journal_destorying?
>> 
>> Right, that makes sence. I will incorporate these changes in the next 
>> revision.
>> 
>
> Think about this again, it seems that we no longer need the destroying
> flag. Because we force to commit and wait for the **last** transaction to
> complete, and the flush work should also ensure that the last sb_update
> work to complete. Regardless of whether it starts a new handle in the
> last update_super_work(), it will not commit since the journal should
> have aborted. What are your thoughts?
>

I think the confusion maybe coming because v2 patch isn't where we
discussed to put the s_journal_destroying to true, in this thread [1]

[1]: https://lore.kernel.org/linux-ext4/jnxpphuradrsf73cxfmohfu7wwwckihtulw6ovsitddgt5pqkg@2uoejkr66qnl/


>  ext4_put_super()

   + sbi->s_journal_destroying = true;

We should add s_journal_destroying to true before calling for
flush_work. 

>   flush_work(&sbi->s_sb_upd_work)

After the above flush work is complete, we will always check if
s_journal_destroying is set. If yes, then we should never schedule the
sb update work

>   destroy_workqueue(sbi->rsv_conversion_wq)
>
>   ext4_journal_destroy()
>    /* trigger a commit (it will commit the last trnasaction) */
>
>                     **kjournald2**
>                     jbd2_journal_commit_transaction()
>                     ...
>                      ext4_inode_error()
>                       schedule_work(s_sb_upd_work))

Then this schedule work will never happen, since it will check if
s_journal_destroying flag is set. 

>
>                                      **workqueue**
>                                       update_super_work()
>                                         jbd2_journal_start(journal)
>                                           start_this_handle()
>                                           //This new trans will
>                                           //not be committed.
>
>                      jbd2_journal_abort()
>
>    /* wait for it to complete */
>
>    flush_work(&sbi->s_sb_upd_work)

No need to again call the flush work here, since there is no new work
which will be scheduled right.

Am I missing something?

-ritesh


>    jbd2_journal_destroy()
>     journal->j_flags |= JBD2_UNMOUNT;
>    jbd2_journal_commit_transaction() //it will commit nothing
>
> Thanks,
> Yi.

  parent reply	other threads:[~2025-03-08 10:10 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-06 14:28 [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Ojaswin Mujoo
2025-03-06 14:28 ` [PATCH v2 1/3] ext4: define ext4_journal_destroy wrapper Ojaswin Mujoo
2025-03-07 14:05   ` Jan Kara
2025-03-08  1:18   ` Baokun Li
2025-03-06 14:28 ` [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying Ojaswin Mujoo
2025-03-07  2:49   ` Zhang Yi
2025-03-07  6:34     ` Ojaswin Mujoo
2025-03-07  8:13       ` Ojaswin Mujoo
2025-03-07  8:43         ` Zhang Yi
2025-03-07 10:27           ` Ojaswin Mujoo
2025-03-07 12:36             ` Zhang Yi
2025-03-07 17:26               ` Ojaswin Mujoo
2025-03-08  2:57                 ` Zhang Yi
2025-03-08  8:18                   ` Ojaswin Mujoo
2025-03-08  9:58                     ` Ojaswin Mujoo
2025-03-08 10:10                       ` Baokun Li
2025-03-08 12:57                         ` Ojaswin Mujoo
2025-03-08 10:01                   ` Ritesh Harjani [this message]
2025-03-07 14:26   ` Jan Kara
2025-03-07 17:00     ` Ojaswin Mujoo
2025-03-08  9:55   ` Ritesh Harjani (IBM)
2025-03-08 13:05     ` Ojaswin Mujoo
2025-03-08 13:26       ` Ritesh Harjani
2025-03-08 14:58         ` Ojaswin Mujoo
2025-03-08 18:41           ` Ritesh Harjani
2025-03-09 12:07             ` Ojaswin Mujoo
2025-03-10  4:43               ` Ritesh Harjani
2025-03-12 10:51                 ` Jan Kara
2025-03-12 14:26                   ` Ojaswin Mujoo
2025-03-12 17:15                     ` Jan Kara
2025-03-13  1:20                       ` Zhang Yi
2025-03-13  2:08                         ` Baokun Li
2025-03-12 13:57   ` Eric Sandeen
2025-03-12 14:27     ` Ojaswin Mujoo
2025-03-06 14:28 ` [PATCH v2 3/3] ext4: Make sb update interval tunable Ojaswin Mujoo
2025-03-08  2:25   ` Baokun Li
2025-03-08  1:09 ` [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle Baokun Li
2025-03-08 13:07   ` Ojaswin Mujoo
2025-03-08 15:21     ` Ojaswin Mujoo
2025-03-10  3:10       ` Baokun Li
2025-03-10  7:59         ` Ojaswin Mujoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jz8zhmcv.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=jack@suse.cz \
    --cc=libaokun1@huawei.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maheshkumar657g@gmail.com \
    --cc=ojaswin@linux.ibm.com \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huaweicloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.