[PATCH -next] ext4: add an update to i_disksize in ext4_block_page

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
@ 2025-07-31 14:05 sunyongjian
  2025-09-01  7:01 ` Sun Yongjian
  0 siblings, 1 reply; 9+ messages in thread
From: sunyongjian @ 2025-07-31 14:05 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, yangerkun, yi.zhang, libaokun1, chengzhihao1,
	sunyongjian1

From: Yongjian Sun <sunyongjian1@huawei.com>

After running a stress test combined with fault injection,
we performed fsck -a followed by fsck -fn on the filesystem
image. During the second pass, fsck -fn reported:

Inode 131512, end of extent exceeds allowed value
	(logical block 405, physical block 1180540, len 2)

This inode was not in the orphan list. Analysis revealed the
following call chain that leads to the inconsistency:

                             ext4_da_write_end()
                              //does not update i_disksize
                             ext4_punch_hole()
                              //truncate folio, keep size
ext4_page_mkwrite()
 ext4_block_page_mkwrite()
  ext4_block_write_begin()
    ext4_get_block()
     //insert written extent without update i_disksize
journal commit
echo 1 > /sys/block/xxx/device/delete

da-write path updates i_size but does not update i_disksize. Then
ext4_punch_hole truncates the da-folio yet still leaves i_disksize
unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
and takes the nodioread_nolock path, the folio about to be written
has just been punched out, and it’s offset sits beyond the current
i_disksize. This may result in a written extent being inserted, but
again does not update i_disksize. If the journal gets committed and
then the block device is yanked, we might run into this.

To fix this, we now check in ext4_block_page_mkwrite whether
i_disksize needs to be updated to cover the newly allocated blocks.

Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
---
 fs/ext4/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ed54c4d0f2f9..050270b265ae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
 		goto out_error;

 	if (!ext4_should_journal_data(inode)) {
+		loff_t disksize = folio_pos(folio) + len;
 		block_commit_write(folio, 0, len);
 		folio_mark_dirty(folio);
+		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
+			down_write(&EXT4_I(inode)->i_data_sem);
+			if (disksize > EXT4_I(inode)->i_disksize)
+				EXT4_I(inode)->i_disksize = disksize;
+			up_write(&EXT4_I(inode)->i_data_sem);
+			ret = ext4_mark_inode_dirty(handle, inode);
+			if (ret)
+				goto out_error;
+		}
 	} else {
 		ret = ext4_journal_folio_buffers(handle, folio, len);
 		if (ret)
-- 
2.39.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-07-31 14:05 [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite sunyongjian
@ 2025-09-01  7:01 ` Sun Yongjian
  2025-09-03 15:03   ` Jan Kara
  2025-09-04  9:11   ` Jan Kara
  0 siblings, 2 replies; 9+ messages in thread
From: Sun Yongjian @ 2025-09-01  7:01 UTC (permalink / raw)
  To: linux-ext4; +Cc: yangerkun, yi.zhang, libaokun1, tytso, jack



在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
Gentle ping.
> From: Yongjian Sun <sunyongjian1@huawei.com>
> 
> After running a stress test combined with fault injection,
> we performed fsck -a followed by fsck -fn on the filesystem
> image. During the second pass, fsck -fn reported:
> 
> Inode 131512, end of extent exceeds allowed value
> 	(logical block 405, physical block 1180540, len 2)
> 
> This inode was not in the orphan list. Analysis revealed the
> following call chain that leads to the inconsistency:
> 
>                               ext4_da_write_end()
>                                //does not update i_disksize
>                               ext4_punch_hole()
>                                //truncate folio, keep size
> ext4_page_mkwrite()
>   ext4_block_page_mkwrite()
>    ext4_block_write_begin()
>      ext4_get_block()
>       //insert written extent without update i_disksize
> journal commit
> echo 1 > /sys/block/xxx/device/delete
> 
> da-write path updates i_size but does not update i_disksize. Then
> ext4_punch_hole truncates the da-folio yet still leaves i_disksize
> unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
> and takes the nodioread_nolock path, the folio about to be written
> has just been punched out, and it’s offset sits beyond the current
> i_disksize. This may result in a written extent being inserted, but
> again does not update i_disksize. If the journal gets committed and
> then the block device is yanked, we might run into this.
> 
> To fix this, we now check in ext4_block_page_mkwrite whether
> i_disksize needs to be updated to cover the newly allocated blocks.
> 
> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
> ---
>   fs/ext4/inode.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ed54c4d0f2f9..050270b265ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>   		goto out_error;
>   
>   	if (!ext4_should_journal_data(inode)) {
> +		loff_t disksize = folio_pos(folio) + len;
>   		block_commit_write(folio, 0, len);
>   		folio_mark_dirty(folio);
> +		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
> +			down_write(&EXT4_I(inode)->i_data_sem);
> +			if (disksize > EXT4_I(inode)->i_disksize)
> +				EXT4_I(inode)->i_disksize = disksize;
> +			up_write(&EXT4_I(inode)->i_data_sem);
> +			ret = ext4_mark_inode_dirty(handle, inode);
> +			if (ret)
> +				goto out_error;
> +		}
>   	} else {
>   		ret = ext4_journal_folio_buffers(handle, folio, len);
>   		if (ret)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-09-01  7:01 ` Sun Yongjian
@ 2025-09-03 15:03   ` Jan Kara
  2025-09-04  4:03     ` Ritesh Harjani
  2025-09-04  9:11   ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2025-09-03 15:03 UTC (permalink / raw)
  To: Sun Yongjian
  Cc: linux-ext4, yangerkun, yi.zhang, libaokun1, tytso, jack,
	Ritesh Harjani

On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
> 在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
> Gentle ping.
> > From: Yongjian Sun <sunyongjian1@huawei.com>
> > 
> > After running a stress test combined with fault injection,
> > we performed fsck -a followed by fsck -fn on the filesystem
> > image. During the second pass, fsck -fn reported:
> > 
> > Inode 131512, end of extent exceeds allowed value
> > 	(logical block 405, physical block 1180540, len 2)
> > 
> > This inode was not in the orphan list.

Thanks for report! Interesting... Which kernel were you using?

> > Analysis revealed the
> > following call chain that leads to the inconsistency:
> > 
> >                               ext4_da_write_end()
> >                                //does not update i_disksize

Right, for any write beyond i_disksize to unallocated blocks we update
i_disksize only during page writeback.

> >                               ext4_punch_hole()
> >                                //truncate folio, keep size

So here offset + len passed to ext4_punch_hole() is important. Because
there's ext4_update_disksize_before_punch() call which updates i_disksize
to i_size if the punched hole reaches EOF. So did you punch hole in the
middle of the file?

> > ext4_page_mkwrite()
> >   ext4_block_page_mkwrite()
> >    ext4_block_write_begin()
> >      ext4_get_block()
> >       //insert written extent without update i_disksize

We should insert unwritten extent here, shouldn't we? We use
ext4_get_block_unwritten() when we are inside i_size. Ah, you mention below
you use nodioread_nolock. Nasty :)

> > journal commit
> > echo 1 > /sys/block/xxx/device/delete
> > 
> > da-write path updates i_size but does not update i_disksize. Then
> > ext4_punch_hole truncates the da-folio yet still leaves i_disksize
> > unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
> > and takes the nodioread_nolock path, the folio about to be written
> > has just been punched out, and it’s offset sits beyond the current
> > i_disksize. This may result in a written extent being inserted, but
> > again does not update i_disksize. If the journal gets committed and
> > then the block device is yanked, we might run into this.
> > 
> > To fix this, we now check in ext4_block_page_mkwrite whether
> > i_disksize needs to be updated to cover the newly allocated blocks.
> > 
> > Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>

Hum, rather than complicating this niche code what if we just
unconditionally used ext4_get_block_unwritten() in
ext4_block_page_mkwrite() when delalloc gets disabled? It is far from any
performance critical path. What do people think? The code would actually
have to be something like:

	if (ext4_should_journal_data(inode))
		get_block = ext4_get_block;
	else
		get_block = ext4_get_block_unwritten;

to properly handle data journalling. I'm adding Ritesh to CC because I do
remember there used to be some issues with dioread_nolock with blocksize <
pagesize which he was able to trigger. But I think they were fixed.

								Honza

> > ---
> >   fs/ext4/inode.c | 10 ++++++++++
> >   1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index ed54c4d0f2f9..050270b265ae 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
> >   		goto out_error;
> >   	if (!ext4_should_journal_data(inode)) {
> > +		loff_t disksize = folio_pos(folio) + len;
> >   		block_commit_write(folio, 0, len);
> >   		folio_mark_dirty(folio);
> > +		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
> > +			down_write(&EXT4_I(inode)->i_data_sem);
> > +			if (disksize > EXT4_I(inode)->i_disksize)
> > +				EXT4_I(inode)->i_disksize = disksize;
> > +			up_write(&EXT4_I(inode)->i_data_sem);
> > +			ret = ext4_mark_inode_dirty(handle, inode);
> > +			if (ret)
> > +				goto out_error;
> > +		}
> >   	} else {
> >   		ret = ext4_journal_folio_buffers(handle, folio, len);
> >   		if (ret)
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-09-03 15:03   ` Jan Kara
@ 2025-09-04  4:03     ` Ritesh Harjani
  2025-09-04  9:06       ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Ritesh Harjani @ 2025-09-04  4:03 UTC (permalink / raw)
  To: Jan Kara, Sun Yongjian
  Cc: linux-ext4, yangerkun, yi.zhang, libaokun1, tytso, jack

Jan Kara <jack@suse.cz> writes:

> On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
>> 在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
>> Gentle ping.
>> > From: Yongjian Sun <sunyongjian1@huawei.com>
>> > 
>> > After running a stress test combined with fault injection,
>> > we performed fsck -a followed by fsck -fn on the filesystem
>> > image. During the second pass, fsck -fn reported:
>> > 
>> > Inode 131512, end of extent exceeds allowed value
>> > 	(logical block 405, physical block 1180540, len 2)
>> > 
>> > This inode was not in the orphan list.
>
> Thanks for report! Interesting... Which kernel were you using?
>
>> > Analysis revealed the
>> > following call chain that leads to the inconsistency:
>> > 
>> >                               ext4_da_write_end()
>> >                                //does not update i_disksize
>
> Right, for any write beyond i_disksize to unallocated blocks we update
> i_disksize only during page writeback.
>
>> >                               ext4_punch_hole()
>> >                                //truncate folio, keep size
>
> So here offset + len passed to ext4_punch_hole() is important. Because
> there's ext4_update_disksize_before_punch() call which updates i_disksize
> to i_size if the punched hole reaches EOF. So did you punch hole in the
> middle of the file?
>
>> > ext4_page_mkwrite()
>> >   ext4_block_page_mkwrite()
>> >    ext4_block_write_begin()
>> >      ext4_get_block()
>> >       //insert written extent without update i_disksize
>
> We should insert unwritten extent here, shouldn't we? We use
> ext4_get_block_unwritten() when we are inside i_size. Ah, you mention below
> you use nodioread_nolock. Nasty :)
>
>> > journal commit
>> > echo 1 > /sys/block/xxx/device/delete
>> > 
>> > da-write path updates i_size but does not update i_disksize. Then
>> > ext4_punch_hole truncates the da-folio yet still leaves i_disksize
>> > unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
>> > and takes the nodioread_nolock path, the folio about to be written
>> > has just been punched out, and it’s offset sits beyond the current
>> > i_disksize. This may result in a written extent being inserted, but
>> > again does not update i_disksize. If the journal gets committed and
>> > then the block device is yanked, we might run into this.
>> > 
>> > To fix this, we now check in ext4_block_page_mkwrite whether
>> > i_disksize needs to be updated to cover the newly allocated blocks.
>> > 
>> > Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
>
> Hum, rather than complicating this niche code what if we just
> unconditionally used ext4_get_block_unwritten() in
> ext4_block_page_mkwrite() when delalloc gets disabled? It is far from any
> performance critical path. What do people think? The code would actually
> have to be something like:
>
> 	if (ext4_should_journal_data(inode))
> 		get_block = ext4_get_block;
> 	else
> 		get_block = ext4_get_block_unwritten;
>

The problem mainly was when e2fsck identify a written extent beyond
i_disksize. But if it is unwritten extent, then we are still good. So
using ext4_get_block_unwritten() by default in page fault path make
sense.

So what about other checks like S_ISREG() and file with indirect blocks
in ext4_should_dioread_nolock()? We still need ext4_get_block() for them right? 
(Do we even fault on !S_ISREG() :) ?)


> to properly handle data journalling. I'm adding Ritesh to CC because I do
> remember there used to be some issues with dioread_nolock with blocksize <
> pagesize which he was able to trigger. But I think they were fixed.
>

Right, they were fixed and dioread_nolock is now the default mount
option for ext4 for bs <= ps. Here are some of the fixes which were
made. [2] was the more recent one from Ojaswin.

[1]: https://lore.kernel.org/all/af902b5db99e8b73980c795d84ad7bb417487e76.1602168865.git.riteshh@linux.ibm.com/
[2]: https://lore.kernel.org/all/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com/
[3]: This patch enabled dioread_nolock for bs < ps.. 
https://lore.kernel.org/all/20231101154717.531865-1-ojaswin@linux.ibm.com/

> 								Honza
>
>> > ---
>> >   fs/ext4/inode.c | 10 ++++++++++
>> >   1 file changed, 10 insertions(+)
>> > 
>> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> > index ed54c4d0f2f9..050270b265ae 100644
>> > --- a/fs/ext4/inode.c
>> > +++ b/fs/ext4/inode.c
>> > @@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>> >   		goto out_error;
>> >   	if (!ext4_should_journal_data(inode)) {
>> > +		loff_t disksize = folio_pos(folio) + len;
>> >   		block_commit_write(folio, 0, len);
>> >   		folio_mark_dirty(folio);
>> > +		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>> > +			down_write(&EXT4_I(inode)->i_data_sem);
>> > +			if (disksize > EXT4_I(inode)->i_disksize)
>> > +				EXT4_I(inode)->i_disksize = disksize;
>> > +			up_write(&EXT4_I(inode)->i_data_sem);
>> > +			ret = ext4_mark_inode_dirty(handle, inode);
>> > +			if (ret)
>> > +				goto out_error;
>> > +		}
>> >   	} else {
>> >   		ret = ext4_journal_folio_buffers(handle, folio, len);
>> >   		if (ret)
>> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-09-04  4:03     ` Ritesh Harjani
@ 2025-09-04  9:06       ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2025-09-04  9:06 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, Sun Yongjian, linux-ext4, yangerkun, yi.zhang,
	libaokun1, tytso

On Thu 04-09-25 09:33:56, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> > On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
> >> 在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
> >> Gentle ping.
> >> > From: Yongjian Sun <sunyongjian1@huawei.com>
> >> > 
> >> > After running a stress test combined with fault injection,
> >> > we performed fsck -a followed by fsck -fn on the filesystem
> >> > image. During the second pass, fsck -fn reported:
> >> > 
> >> > Inode 131512, end of extent exceeds allowed value
> >> > 	(logical block 405, physical block 1180540, len 2)
> >> > 
> >> > This inode was not in the orphan list.
> >
> > Thanks for report! Interesting... Which kernel were you using?
> >
> >> > Analysis revealed the
> >> > following call chain that leads to the inconsistency:
> >> > 
> >> >                               ext4_da_write_end()
> >> >                                //does not update i_disksize
> >
> > Right, for any write beyond i_disksize to unallocated blocks we update
> > i_disksize only during page writeback.
> >
> >> >                               ext4_punch_hole()
> >> >                                //truncate folio, keep size
> >
> > So here offset + len passed to ext4_punch_hole() is important. Because
> > there's ext4_update_disksize_before_punch() call which updates i_disksize
> > to i_size if the punched hole reaches EOF. So did you punch hole in the
> > middle of the file?
> >
> >> > ext4_page_mkwrite()
> >> >   ext4_block_page_mkwrite()
> >> >    ext4_block_write_begin()
> >> >      ext4_get_block()
> >> >       //insert written extent without update i_disksize
> >
> > We should insert unwritten extent here, shouldn't we? We use
> > ext4_get_block_unwritten() when we are inside i_size. Ah, you mention below
> > you use nodioread_nolock. Nasty :)
> >
> >> > journal commit
> >> > echo 1 > /sys/block/xxx/device/delete
> >> > 
> >> > da-write path updates i_size but does not update i_disksize. Then
> >> > ext4_punch_hole truncates the da-folio yet still leaves i_disksize
> >> > unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
> >> > and takes the nodioread_nolock path, the folio about to be written
> >> > has just been punched out, and it’s offset sits beyond the current
> >> > i_disksize. This may result in a written extent being inserted, but
> >> > again does not update i_disksize. If the journal gets committed and
> >> > then the block device is yanked, we might run into this.
> >> > 
> >> > To fix this, we now check in ext4_block_page_mkwrite whether
> >> > i_disksize needs to be updated to cover the newly allocated blocks.
> >> > 
> >> > Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
> >
> > Hum, rather than complicating this niche code what if we just
> > unconditionally used ext4_get_block_unwritten() in
> > ext4_block_page_mkwrite() when delalloc gets disabled? It is far from any
> > performance critical path. What do people think? The code would actually
> > have to be something like:
> >
> > 	if (ext4_should_journal_data(inode))
> > 		get_block = ext4_get_block;
> > 	else
> > 		get_block = ext4_get_block_unwritten;
> >
> 
> The problem mainly was when e2fsck identify a written extent beyond
> i_disksize. But if it is unwritten extent, then we are still good. So
> using ext4_get_block_unwritten() by default in page fault path make
> sense.
> 
> So what about other checks like S_ISREG() and file with indirect blocks
> in ext4_should_dioread_nolock()? We still need ext4_get_block() for them right? 
> (Do we even fault on !S_ISREG() :) ?)

S_ISREG() is always true for page faults but I forgot about indirect block
based inodes, there we indeed cannot create unwritten extent. Drat, so
probably what Sun suggests is the easiest solution in the end. Going back
to his patch.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-09-01  7:01 ` Sun Yongjian
  2025-09-03 15:03   ` Jan Kara
@ 2025-09-04  9:11   ` Jan Kara
  2025-09-05  3:25     ` Sun Yongjian
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2025-09-04  9:11 UTC (permalink / raw)
  To: Sun Yongjian; +Cc: linux-ext4, yangerkun, yi.zhang, libaokun1, tytso, jack

On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
> 在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
> Gentle ping.
> > From: Yongjian Sun <sunyongjian1@huawei.com>
> > 
> > After running a stress test combined with fault injection,
> > we performed fsck -a followed by fsck -fn on the filesystem
> > image. During the second pass, fsck -fn reported:
> > 
> > Inode 131512, end of extent exceeds allowed value
> > 	(logical block 405, physical block 1180540, len 2)
> > 
> > This inode was not in the orphan list. Analysis revealed the
> > following call chain that leads to the inconsistency:
> > 
> >                               ext4_da_write_end()
> >                                //does not update i_disksize
> >                               ext4_punch_hole()
> >                                //truncate folio, keep size
> > ext4_page_mkwrite()
> >   ext4_block_page_mkwrite()
> >    ext4_block_write_begin()
> >      ext4_get_block()
> >       //insert written extent without update i_disksize
> > journal commit
> > echo 1 > /sys/block/xxx/device/delete
> > 
> > da-write path updates i_size but does not update i_disksize. Then
> > ext4_punch_hole truncates the da-folio yet still leaves i_disksize
> > unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
> > and takes the nodioread_nolock path, the folio about to be written
> > has just been punched out, and it’s offset sits beyond the current
> > i_disksize. This may result in a written extent being inserted, but
> > again does not update i_disksize. If the journal gets committed and
> > then the block device is yanked, we might run into this.
> > 
> > To fix this, we now check in ext4_block_page_mkwrite whether
> > i_disksize needs to be updated to cover the newly allocated blocks.
> > 
> > Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>

OK, after the discussion with Ritesh your solution looks like the best one.
Just two nits below:

> > ---
> >   fs/ext4/inode.c | 10 ++++++++++
> >   1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index ed54c4d0f2f9..050270b265ae 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
> >   		goto out_error;
> >   	if (!ext4_should_journal_data(inode)) {
> > +		loff_t disksize = folio_pos(folio) + len;

Use an empty line between declarations and the code please.

> >   		block_commit_write(folio, 0, len);
> >   		folio_mark_dirty(folio);
> > +		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
> > +			down_write(&EXT4_I(inode)->i_data_sem);
> > +			if (disksize > EXT4_I(inode)->i_disksize)
> > +				EXT4_I(inode)->i_disksize = disksize;
> > +			up_write(&EXT4_I(inode)->i_data_sem);
> > +			ret = ext4_mark_inode_dirty(handle, inode);
> > +			if (ret)
> > +				goto out_error;
> > +		}

Since we don't support delalloc with data journalling, your code is correct
but I think it would be more understandable if you just moved the
i_disksize update outside of the "if (!ext4_should_journal_data(inode))"
condition.

> >   	} else {
> >   		ret = ext4_journal_folio_buffers(handle, folio, len);
> >   		if (ret)
> 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-09-04  9:11   ` Jan Kara
@ 2025-09-05  3:25     ` Sun Yongjian
  2025-09-05 12:58       ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Sun Yongjian @ 2025-09-05  3:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, yangerkun, yi.zhang, libaokun1, tytso



在 2025/9/4 17:11, Jan Kara 写道:
> On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
>> 在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
>> Gentle ping.
>>> From: Yongjian Sun <sunyongjian1@huawei.com>
>>>
>>> After running a stress test combined with fault injection,
>>> we performed fsck -a followed by fsck -fn on the filesystem
>>> image. During the second pass, fsck -fn reported:
>>>
>>> Inode 131512, end of extent exceeds allowed value
>>> 	(logical block 405, physical block 1180540, len 2)
>>>
>>> This inode was not in the orphan list. Analysis revealed the
>>> following call chain that leads to the inconsistency:
>>>
>>>                                ext4_da_write_end()
>>>                                 //does not update i_disksize
>>>                                ext4_punch_hole()
>>>                                 //truncate folio, keep size
>>> ext4_page_mkwrite()
>>>    ext4_block_page_mkwrite()
>>>     ext4_block_write_begin()
>>>       ext4_get_block()
>>>        //insert written extent without update i_disksize
>>> journal commit
>>> echo 1 > /sys/block/xxx/device/delete
>>>
>>> da-write path updates i_size but does not update i_disksize. Then
>>> ext4_punch_hole truncates the da-folio yet still leaves i_disksize
>>> unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
>>> and takes the nodioread_nolock path, the folio about to be written
>>> has just been punched out, and it’s offset sits beyond the current
>>> i_disksize. This may result in a written extent being inserted, but
>>> again does not update i_disksize. If the journal gets committed and
>>> then the block device is yanked, we might run into this.
>>>
>>> To fix this, we now check in ext4_block_page_mkwrite whether
>>> i_disksize needs to be updated to cover the newly allocated blocks.
>>>
>>> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
> 
> OK, after the discussion with Ritesh your solution looks like the best one.
> Just two nits below:
> 
>>> ---
>>>    fs/ext4/inode.c | 10 ++++++++++
>>>    1 file changed, 10 insertions(+)
>>>
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index ed54c4d0f2f9..050270b265ae 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>>>    		goto out_error;
>>>    	if (!ext4_should_journal_data(inode)) {
>>> +		loff_t disksize = folio_pos(folio) + len;
> 
> Use an empty line between declarations and the code please.
> 
>>>    		block_commit_write(folio, 0, len);
>>>    		folio_mark_dirty(folio);
>>> +		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>>> +			down_write(&EXT4_I(inode)->i_data_sem);
>>> +			if (disksize > EXT4_I(inode)->i_disksize)
>>> +				EXT4_I(inode)->i_disksize = disksize;
>>> +			up_write(&EXT4_I(inode)->i_data_sem);
>>> +			ret = ext4_mark_inode_dirty(handle, inode);
>>> +			if (ret)
>>> +				goto out_error;
>>> +		}
> 
> Since we don't support delalloc with data journalling, your code is correct
> but I think it would be more understandable if you just moved the
> i_disksize update outside of the "if (!ext4_should_journal_data(inode))"
> condition.
> 
>>>    	} else {
>>>    		ret = ext4_journal_folio_buffers(handle, folio, len);
>>>    		if (ret)
>>
> 
> 								Honza
Thanks for the review, I will send a patch to improve this!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-09-05  3:25     ` Sun Yongjian
@ 2025-09-05 12:58       ` Jan Kara
  2025-09-06 12:27         ` Sun Yongjian
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2025-09-05 12:58 UTC (permalink / raw)
  To: Sun Yongjian; +Cc: Jan Kara, linux-ext4, yangerkun, yi.zhang, libaokun1, tytso

On Fri 05-09-25 11:25:49, Sun Yongjian wrote:
> 在 2025/9/4 17:11, Jan Kara 写道:
> > On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
> > > 在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
> > > Gentle ping.
> > > > From: Yongjian Sun <sunyongjian1@huawei.com>
> > > > 
> > > > After running a stress test combined with fault injection,
> > > > we performed fsck -a followed by fsck -fn on the filesystem
> > > > image. During the second pass, fsck -fn reported:
> > > > 
> > > > Inode 131512, end of extent exceeds allowed value
> > > > 	(logical block 405, physical block 1180540, len 2)
> > > > 
> > > > This inode was not in the orphan list. Analysis revealed the
> > > > following call chain that leads to the inconsistency:
> > > > 
> > > >                                ext4_da_write_end()
> > > >                                 //does not update i_disksize
> > > >                                ext4_punch_hole()
> > > >                                 //truncate folio, keep size
> > > > ext4_page_mkwrite()
> > > >    ext4_block_page_mkwrite()
> > > >     ext4_block_write_begin()
> > > >       ext4_get_block()
> > > >        //insert written extent without update i_disksize
> > > > journal commit
> > > > echo 1 > /sys/block/xxx/device/delete
> > > > 
> > > > da-write path updates i_size but does not update i_disksize. Then
> > > > ext4_punch_hole truncates the da-folio yet still leaves i_disksize
> > > > unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
> > > > and takes the nodioread_nolock path, the folio about to be written
> > > > has just been punched out, and it’s offset sits beyond the current
> > > > i_disksize. This may result in a written extent being inserted, but
> > > > again does not update i_disksize. If the journal gets committed and
> > > > then the block device is yanked, we might run into this.
> > > > 
> > > > To fix this, we now check in ext4_block_page_mkwrite whether
> > > > i_disksize needs to be updated to cover the newly allocated blocks.
> > > > 
> > > > Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
> > 
> > OK, after the discussion with Ritesh your solution looks like the best one.
> > Just two nits below:
> > 
> > > > ---
> > > >    fs/ext4/inode.c | 10 ++++++++++
> > > >    1 file changed, 10 insertions(+)
> > > > 
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index ed54c4d0f2f9..050270b265ae 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
> > > >    		goto out_error;
> > > >    	if (!ext4_should_journal_data(inode)) {
> > > > +		loff_t disksize = folio_pos(folio) + len;
> > 
> > Use an empty line between declarations and the code please.
> > 
> > > >    		block_commit_write(folio, 0, len);
> > > >    		folio_mark_dirty(folio);
> > > > +		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
> > > > +			down_write(&EXT4_I(inode)->i_data_sem);
> > > > +			if (disksize > EXT4_I(inode)->i_disksize)
> > > > +				EXT4_I(inode)->i_disksize = disksize;
> > > > +			up_write(&EXT4_I(inode)->i_data_sem);
> > > > +			ret = ext4_mark_inode_dirty(handle, inode);
> > > > +			if (ret)
> > > > +				goto out_error;
> > > > +		}
> > 
> > Since we don't support delalloc with data journalling, your code is correct
> > but I think it would be more understandable if you just moved the
> > i_disksize update outside of the "if (!ext4_should_journal_data(inode))"
> > condition.
> > 
> > > >    	} else {
> > > >    		ret = ext4_journal_folio_buffers(handle, folio, len);
> > > >    		if (ret)
> > > 
> > 
> > 								Honza
> Thanks for the review, I will send a patch to improve this!

Yesterday on ext4 developers call we were further discussing this and Ted
came up with a different way of addressing this issue which might be even
better. Instead of updating i_disksize in ext4_page_mkwrite() we can
instead update i_disksize already during the hole punch. I.e., we can modify
ext4_update_disksize_before_punch() to always increase i_disksize to offset
+ len. That should deal with the problem as well and we would avoid
updating i_disksize from page_mkwrite() which is a bit awkward special case.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite
  2025-09-05 12:58       ` Jan Kara
@ 2025-09-06 12:27         ` Sun Yongjian
  0 siblings, 0 replies; 9+ messages in thread
From: Sun Yongjian @ 2025-09-06 12:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, yangerkun, yi.zhang, libaokun1, tytso



在 2025/9/5 20:58, Jan Kara 写道:
> On Fri 05-09-25 11:25:49, Sun Yongjian wrote:
>> 在 2025/9/4 17:11, Jan Kara 写道:
>>> On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
>>>> 在 2025/7/31 22:05, sunyongjian@huaweicloud.com 写道:
>>>> Gentle ping.
>>>>> From: Yongjian Sun <sunyongjian1@huawei.com>
>>>>>
>>>>> After running a stress test combined with fault injection,
>>>>> we performed fsck -a followed by fsck -fn on the filesystem
>>>>> image. During the second pass, fsck -fn reported:
>>>>>
>>>>> Inode 131512, end of extent exceeds allowed value
>>>>> 	(logical block 405, physical block 1180540, len 2)
>>>>>
>>>>> This inode was not in the orphan list. Analysis revealed the
>>>>> following call chain that leads to the inconsistency:
>>>>>
>>>>>                                 ext4_da_write_end()
>>>>>                                  //does not update i_disksize
>>>>>                                 ext4_punch_hole()
>>>>>                                  //truncate folio, keep size
>>>>> ext4_page_mkwrite()
>>>>>     ext4_block_page_mkwrite()
>>>>>      ext4_block_write_begin()
>>>>>        ext4_get_block()
>>>>>         //insert written extent without update i_disksize
>>>>> journal commit
>>>>> echo 1 > /sys/block/xxx/device/delete
>>>>>
>>>>> da-write path updates i_size but does not update i_disksize. Then
>>>>> ext4_punch_hole truncates the da-folio yet still leaves i_disksize
>>>>> unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
>>>>> and takes the nodioread_nolock path, the folio about to be written
>>>>> has just been punched out, and it’s offset sits beyond the current
>>>>> i_disksize. This may result in a written extent being inserted, but
>>>>> again does not update i_disksize. If the journal gets committed and
>>>>> then the block device is yanked, we might run into this.
>>>>>
>>>>> To fix this, we now check in ext4_block_page_mkwrite whether
>>>>> i_disksize needs to be updated to cover the newly allocated blocks.
>>>>>
>>>>> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
>>>
>>> OK, after the discussion with Ritesh your solution looks like the best one.
>>> Just two nits below:
>>>
>>>>> ---
>>>>>     fs/ext4/inode.c | 10 ++++++++++
>>>>>     1 file changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index ed54c4d0f2f9..050270b265ae 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>>>>>     		goto out_error;
>>>>>     	if (!ext4_should_journal_data(inode)) {
>>>>> +		loff_t disksize = folio_pos(folio) + len;
>>>
>>> Use an empty line between declarations and the code please.
>>>
>>>>>     		block_commit_write(folio, 0, len);
>>>>>     		folio_mark_dirty(folio);
>>>>> +		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>>>>> +			down_write(&EXT4_I(inode)->i_data_sem);
>>>>> +			if (disksize > EXT4_I(inode)->i_disksize)
>>>>> +				EXT4_I(inode)->i_disksize = disksize;
>>>>> +			up_write(&EXT4_I(inode)->i_data_sem);
>>>>> +			ret = ext4_mark_inode_dirty(handle, inode);
>>>>> +			if (ret)
>>>>> +				goto out_error;
>>>>> +		}
>>>
>>> Since we don't support delalloc with data journalling, your code is correct
>>> but I think it would be more understandable if you just moved the
>>> i_disksize update outside of the "if (!ext4_should_journal_data(inode))"
>>> condition.
>>>
>>>>>     	} else {
>>>>>     		ret = ext4_journal_folio_buffers(handle, folio, len);
>>>>>     		if (ret)
>>>>
>>>
>>> 								Honza
>> Thanks for the review, I will send a patch to improve this!
> 
> Yesterday on ext4 developers call we were further discussing this and Ted
> came up with a different way of addressing this issue which might be even
> better. Instead of updating i_disksize in ext4_page_mkwrite() we can
> instead update i_disksize already during the hole punch. I.e., we can modify
> ext4_update_disksize_before_punch() to always increase i_disksize to offset
> + len. That should deal with the problem as well and we would avoid
> updating i_disksize from page_mkwrite() which is a bit awkward special case.
> 
> 								Honza
> 
I believe this bring a more elegant approach to the matter, let's try this!

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-06 12:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31 14:05 [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite sunyongjian
2025-09-01  7:01 ` Sun Yongjian
2025-09-03 15:03   ` Jan Kara
2025-09-04  4:03     ` Ritesh Harjani
2025-09-04  9:06       ` Jan Kara
2025-09-04  9:11   ` Jan Kara
2025-09-05  3:25     ` Sun Yongjian
2025-09-05 12:58       ` Jan Kara
2025-09-06 12:27         ` Sun Yongjian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).