* [PATCH] ext4: inline: do not convert when writing to memory map
@ 2025-05-19 10:42 Thadeu Lima de Souza Cascardo
2025-05-20 14:57 ` Theodore Ts'o
0 siblings, 1 reply; 5+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-05-19 10:42 UTC (permalink / raw)
To: Theodore Ts'o, Jan Kara, Tao Ma, Andreas Dilger, Eric Biggers
Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-dev,
syzbot+0c89d865531d053abb2d, Thadeu Lima de Souza Cascardo,
stable
inline data handling has a race between writing and writing to a memory
map.
When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
destroys the inline data, but if block allocation fails, restores the
inline data. In that process, we could have:
CPU1 CPU2
destroy_inline_data
write_begin (does not see inline data)
restory_inline_data
write_end (sees inline data)
This leads to bugs like the one below, as write_begin did not prepare for
the case of inline data, which is expected by the write_end side of it.
------------[ cut here ]------------
kernel BUG at fs/ext4/inline.c:235!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 5838 Comm: syz-executor110 Not tainted 6.13.0-rc3-syzkaller-00209-g499551201b5f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:ext4_write_inline_data fs/ext4/inline.c:235 [inline]
RIP: 0010:ext4_write_inline_data_end+0xdc7/0xdd0 fs/ext4/inline.c:774
Code: 47 1d 8c e8 4b 3a 91 ff 90 0f 0b e8 63 7a 47 ff 48 8b 7c 24 10 48 c7 c6 e0 47 1d 8c e8 32 3a 91 ff 90 0f 0b e8 4a 7a 47 ff 90 <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffc900031c7320 EFLAGS: 00010293
RAX: ffffffff8257f9a6 RBX: 000000000000005a RCX: ffff888012968000
RDX: 0000000000000000 RSI: 000000000000005a RDI: 000000000000005b
RBP: ffffc900031c7448 R08: ffffffff8257ef87 R09: 1ffff11006806070
R10: dffffc0000000000 R11: ffffed1006806071 R12: 000000000000005a
R13: dffffc0000000000 R14: ffff888076b65bd8 R15: 000000000000005b
FS: 00007f5c6bacf6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000a00 CR3: 0000000073fb6000 CR4: 0000000000350ef0
Call Trace:
<TASK>
generic_perform_write+0x6f8/0x990 mm/filemap.c:4070
ext4_buffered_write_iter+0xc5/0x350 fs/ext4/file.c:299
ext4_file_write_iter+0x892/0x1c50
iter_file_splice_write+0xbfc/0x1510 fs/splice.c:743
do_splice_from fs/splice.c:941 [inline]
direct_splice_actor+0x11d/0x220 fs/splice.c:1164
splice_direct_to_actor+0x588/0xc80 fs/splice.c:1108
do_splice_direct_actor fs/splice.c:1207 [inline]
do_splice_direct+0x289/0x3e0 fs/splice.c:1233
do_sendfile+0x564/0x8a0 fs/read_write.c:1363
__do_sys_sendfile64 fs/read_write.c:1424 [inline]
__se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1410
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5c6bb18d09
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5c6bacf218 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007f5c6bba0708 RCX: 00007f5c6bb18d09
RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000004
RBP: 00007f5c6bba0700 R08: 0000000000000000 R09: 0000000000000000
R10: 000080001d00c0d0 R11: 0000000000000246 R12: 00007f5c6bb6d620
R13: 00007f5c6bb6d0c0 R14: 0031656c69662f2e R15: 8088e3ad122bc192
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
This happens because ext4_page_mkwrite is not protected by the inode_lock.
The xattr semaphore is not sufficient to protect inline data handling in a
sane way, so we need to rely on the inode_lock. Adding the inode_lock to
ext4_page_mkwrite is not an option, otherwise lock-ordering problems with
mmap_lock may arise.
The conversion inside ext4_page_mkwrite was introduced at commit
7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
fixes a documented bug in the commit message, which suggests some
alternatives fixes.
The alternative of keeping the data inline is left as an exercise for the
reader.
Instead, block allocation still happens, just as it does right now. But
removal of the inline data is only done when pages are written back.
Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map")
Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Cc: stable@vger.kernel.org
---
fs/ext4/inode.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 94c7d2d828a64e42ded09c82497ed7617071aa19..38653d5c8b32ede2b130285ab13ef1e96b1ba783 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2620,8 +2620,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
ret = PTR_ERR(handle);
goto out_writepages;
}
- BUG_ON(ext4_test_inode_state(inode,
- EXT4_STATE_MAY_INLINE_DATA));
+ ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
ext4_destroy_inline_data(handle, inode);
ext4_journal_stop(handle);
}
@@ -6222,10 +6221,6 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
filemap_invalidate_lock_shared(mapping);
- err = ext4_convert_inline_data(inode);
- if (err)
- goto out_ret;
-
/*
* On data journalling we skip straight to the transaction handle:
* there's no delalloc; page truncated will be checked later; the
---
base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21
change-id: 20250519-ext4_inline_page_mkwrite-c42ca1f02295
Best regards,
--
Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] ext4: inline: do not convert when writing to memory map
2025-05-19 10:42 [PATCH] ext4: inline: do not convert when writing to memory map Thadeu Lima de Souza Cascardo
@ 2025-05-20 14:57 ` Theodore Ts'o
2025-05-21 21:52 ` Thadeu Lima de Souza Cascardo
0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2025-05-20 14:57 UTC (permalink / raw)
To: Thadeu Lima de Souza Cascardo
Cc: Jan Kara, Tao Ma, Andreas Dilger, Eric Biggers, linux-ext4,
linux-fsdevel, linux-kernel, kernel-dev,
syzbot+0c89d865531d053abb2d, stable
On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
> inline data handling has a race between writing and writing to a memory
> map.
>
> When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
> destroys the inline data, but if block allocation fails, restores the
> inline data. In that process, we could have:
>
> CPU1 CPU2
> destroy_inline_data
> write_begin (does not see inline data)
> restory_inline_data
> write_end (sees inline data)
>
> The conversion inside ext4_page_mkwrite was introduced at commit
> 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
> fixes a documented bug in the commit message, which suggests some
> alternatives fixes.
Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON.
While this is great for shutting up the syzbot report, but it causes
file writes to an inline data file via a mmap to never get written
back to the storage device. So you are replacing BUG_ON that can get
triggered on a race condition in case of a failed block allocation,
with silent data corruption. This is not an improvement.
Thanks for trying to address this, but I'm not going to accept your
proposed fix.
- Ted
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ext4: inline: do not convert when writing to memory map
2025-05-20 14:57 ` Theodore Ts'o
@ 2025-05-21 21:52 ` Thadeu Lima de Souza Cascardo
2025-05-26 13:43 ` Jan Kara
0 siblings, 1 reply; 5+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-05-21 21:52 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Jan Kara, Tao Ma, Andreas Dilger, Eric Biggers, linux-ext4,
linux-fsdevel, linux-kernel, kernel-dev,
syzbot+0c89d865531d053abb2d, stable
On Tue, May 20, 2025 at 10:57:08AM -0400, Theodore Ts'o wrote:
> On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
> > inline data handling has a race between writing and writing to a memory
> > map.
> >
> > When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
> > destroys the inline data, but if block allocation fails, restores the
> > inline data. In that process, we could have:
> >
> > CPU1 CPU2
> > destroy_inline_data
> > write_begin (does not see inline data)
> > restory_inline_data
> > write_end (sees inline data)
> >
> > The conversion inside ext4_page_mkwrite was introduced at commit
> > 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
> > fixes a documented bug in the commit message, which suggests some
> > alternatives fixes.
>
> Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON.
> While this is great for shutting up the syzbot report, but it causes
> file writes to an inline data file via a mmap to never get written
> back to the storage device. So you are replacing BUG_ON that can get
> triggered on a race condition in case of a failed block allocation,
> with silent data corruption. This is not an improvement.
>
> Thanks for trying to address this, but I'm not going to accept your
> proposed fix.
>
> - Ted
Hi, Ted.
I am trying to understand better the circumstances where the data loss
might occur with the fix, but might not occur without the fix. Or, even if
they occur either way, such that I can work on a better/proper fix.
Right now, if ext4_convert_inline_data (called from ext4_page_mkwrite)
fails with ENOSPC, the memory access will lead to a SIGBUS. The same will
happen without the fix, if there are no blocks available.
Now, without ext4_convert_inline_data, blocks will be allocated by
ext4_page_mkwrite and written by ext4_do_writepages. Are you concerned
about a failure between the clearing of the inode data and the writing of
the block in ext4_do_writepages?
Or are you concerned about a potential race condition when allocating
blocks?
Which of these cannot happen today with the code as is? If I understand
correctly, the inline conversion code also calls ext4_destroy_inline_data
before allocating and writing to blocks.
Thanks a lot for the review and guidance.
Cascardo.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ext4: inline: do not convert when writing to memory map
2025-05-21 21:52 ` Thadeu Lima de Souza Cascardo
@ 2025-05-26 13:43 ` Jan Kara
2025-05-26 14:10 ` Thadeu Lima de Souza Cascardo
0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2025-05-26 13:43 UTC (permalink / raw)
To: Thadeu Lima de Souza Cascardo
Cc: Theodore Ts'o, Jan Kara, Tao Ma, Andreas Dilger, Eric Biggers,
linux-ext4, linux-fsdevel, linux-kernel, kernel-dev,
syzbot+0c89d865531d053abb2d, stable
On Wed 21-05-25 18:52:03, Thadeu Lima de Souza Cascardo wrote:
> On Tue, May 20, 2025 at 10:57:08AM -0400, Theodore Ts'o wrote:
> > On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
> > > inline data handling has a race between writing and writing to a memory
> > > map.
> > >
> > > When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
> > > destroys the inline data, but if block allocation fails, restores the
> > > inline data. In that process, we could have:
> > >
> > > CPU1 CPU2
> > > destroy_inline_data
> > > write_begin (does not see inline data)
> > > restory_inline_data
> > > write_end (sees inline data)
> > >
> > > The conversion inside ext4_page_mkwrite was introduced at commit
> > > 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
> > > fixes a documented bug in the commit message, which suggests some
> > > alternatives fixes.
> >
> > Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON.
> > While this is great for shutting up the syzbot report, but it causes
> > file writes to an inline data file via a mmap to never get written
> > back to the storage device. So you are replacing BUG_ON that can get
> > triggered on a race condition in case of a failed block allocation,
> > with silent data corruption. This is not an improvement.
> >
> > Thanks for trying to address this, but I'm not going to accept your
> > proposed fix.
> >
> > - Ted
>
> Hi, Ted.
>
> I am trying to understand better the circumstances where the data loss
> might occur with the fix, but might not occur without the fix. Or, even if
> they occur either way, such that I can work on a better/proper fix.
>
> Right now, if ext4_convert_inline_data (called from ext4_page_mkwrite)
> fails with ENOSPC, the memory access will lead to a SIGBUS. The same will
> happen without the fix, if there are no blocks available.
>
> Now, without ext4_convert_inline_data, blocks will be allocated by
> ext4_page_mkwrite and written by ext4_do_writepages. Are you concerned
> about a failure between the clearing of the inode data and the writing of
> the block in ext4_do_writepages?
>
> Or are you concerned about a potential race condition when allocating
> blocks?
>
> Which of these cannot happen today with the code as is? If I understand
> correctly, the inline conversion code also calls ext4_destroy_inline_data
> before allocating and writing to blocks.
>
> Thanks a lot for the review and guidance.
So I'm not sure what Ted was exactly worried about because writeback code
should normally allocate underlying blocks for writeout of the mmaped page
AFAICT. But the problem I can see is that clearing
EXT4_STATE_MAY_INLINE_DATA requires i_rwsem held as otherwise we may be
racing with e.g. write(2) and switching EXT4_STATE_MAY_INLINE_DATA in the
middle of the write will cause bad things (inconsistency between how
write_begin() and write_end() callbacks behave).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ext4: inline: do not convert when writing to memory map
2025-05-26 13:43 ` Jan Kara
@ 2025-05-26 14:10 ` Thadeu Lima de Souza Cascardo
0 siblings, 0 replies; 5+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-05-26 14:10 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Ts'o, Jan Kara, Tao Ma, Andreas Dilger, Eric Biggers,
linux-ext4, linux-fsdevel, linux-kernel, kernel-dev,
syzbot+0c89d865531d053abb2d, stable
On Mon, May 26, 2025 at 03:43:31PM +0200, Jan Kara wrote:
> On Wed 21-05-25 18:52:03, Thadeu Lima de Souza Cascardo wrote:
> > On Tue, May 20, 2025 at 10:57:08AM -0400, Theodore Ts'o wrote:
> > > On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
> > > > inline data handling has a race between writing and writing to a memory
> > > > map.
> > > >
> > > > When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
> > > > destroys the inline data, but if block allocation fails, restores the
> > > > inline data. In that process, we could have:
> > > >
> > > > CPU1 CPU2
> > > > destroy_inline_data
> > > > write_begin (does not see inline data)
> > > > restory_inline_data
> > > > write_end (sees inline data)
> > > >
> > > > The conversion inside ext4_page_mkwrite was introduced at commit
> > > > 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
> > > > fixes a documented bug in the commit message, which suggests some
> > > > alternatives fixes.
> > >
> > > Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON.
> > > While this is great for shutting up the syzbot report, but it causes
> > > file writes to an inline data file via a mmap to never get written
> > > back to the storage device. So you are replacing BUG_ON that can get
> > > triggered on a race condition in case of a failed block allocation,
> > > with silent data corruption. This is not an improvement.
> > >
> > > Thanks for trying to address this, but I'm not going to accept your
> > > proposed fix.
> > >
> > > - Ted
> >
> > Hi, Ted.
> >
> > I am trying to understand better the circumstances where the data loss
> > might occur with the fix, but might not occur without the fix. Or, even if
> > they occur either way, such that I can work on a better/proper fix.
> >
> > Right now, if ext4_convert_inline_data (called from ext4_page_mkwrite)
> > fails with ENOSPC, the memory access will lead to a SIGBUS. The same will
> > happen without the fix, if there are no blocks available.
> >
> > Now, without ext4_convert_inline_data, blocks will be allocated by
> > ext4_page_mkwrite and written by ext4_do_writepages. Are you concerned
> > about a failure between the clearing of the inode data and the writing of
> > the block in ext4_do_writepages?
> >
> > Or are you concerned about a potential race condition when allocating
> > blocks?
> >
> > Which of these cannot happen today with the code as is? If I understand
> > correctly, the inline conversion code also calls ext4_destroy_inline_data
> > before allocating and writing to blocks.
> >
> > Thanks a lot for the review and guidance.
>
> So I'm not sure what Ted was exactly worried about because writeback code
> should normally allocate underlying blocks for writeout of the mmaped page
> AFAICT. But the problem I can see is that clearing
> EXT4_STATE_MAY_INLINE_DATA requires i_rwsem held as otherwise we may be
> racing with e.g. write(2) and switching EXT4_STATE_MAY_INLINE_DATA in the
> middle of the write will cause bad things (inconsistency between how
> write_begin() and write_end() callbacks behave).
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Thanks, Jan.
I later noticed as well that writepages is not holding the inode lock
either, so there would be a potential for race condition there as well.
I have sent a v2 that I find would not have this problem. But we should
probably cleanup the handling of inline data in writepages as a followup.
Cascardo.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-05-26 14:10 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 10:42 [PATCH] ext4: inline: do not convert when writing to memory map Thadeu Lima de Souza Cascardo
2025-05-20 14:57 ` Theodore Ts'o
2025-05-21 21:52 ` Thadeu Lima de Souza Cascardo
2025-05-26 13:43 ` Jan Kara
2025-05-26 14:10 ` Thadeu Lima de Souza Cascardo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).