linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] ext4: inline: convert when mmap is called, not when page is written
@ 2025-05-26 14:13 Thadeu Lima de Souza Cascardo
  2025-05-26 16:20 ` Jan Kara
  0 siblings, 1 reply; 3+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2025-05-26 14:13 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara, Tao Ma, Andreas Dilger, Eric Biggers
  Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-dev,
	syzbot+0c89d865531d053abb2d, Thadeu Lima de Souza Cascardo,
	stable

inline data handling has a race between writing and writing to a memory
map.

When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
destroys the inline data, but if block allocation fails, restores the
inline data. In that process, we could have:

CPU1					CPU2
destroy_inline_data
					write_begin (does not see inline data)
restory_inline_data
					write_end (sees inline data)

This leads to bugs like the one below, as write_begin did not prepare for
the case of inline data, which is expected by the write_end side of it.

------------[ cut here ]------------
kernel BUG at fs/ext4/inline.c:235!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 5838 Comm: syz-executor110 Not tainted 6.13.0-rc3-syzkaller-00209-g499551201b5f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:ext4_write_inline_data fs/ext4/inline.c:235 [inline]
RIP: 0010:ext4_write_inline_data_end+0xdc7/0xdd0 fs/ext4/inline.c:774
Code: 47 1d 8c e8 4b 3a 91 ff 90 0f 0b e8 63 7a 47 ff 48 8b 7c 24 10 48 c7 c6 e0 47 1d 8c e8 32 3a 91 ff 90 0f 0b e8 4a 7a 47 ff 90 <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffc900031c7320 EFLAGS: 00010293
RAX: ffffffff8257f9a6 RBX: 000000000000005a RCX: ffff888012968000
RDX: 0000000000000000 RSI: 000000000000005a RDI: 000000000000005b
RBP: ffffc900031c7448 R08: ffffffff8257ef87 R09: 1ffff11006806070
R10: dffffc0000000000 R11: ffffed1006806071 R12: 000000000000005a
R13: dffffc0000000000 R14: ffff888076b65bd8 R15: 000000000000005b
FS:  00007f5c6bacf6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000a00 CR3: 0000000073fb6000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 generic_perform_write+0x6f8/0x990 mm/filemap.c:4070
 ext4_buffered_write_iter+0xc5/0x350 fs/ext4/file.c:299
 ext4_file_write_iter+0x892/0x1c50
 iter_file_splice_write+0xbfc/0x1510 fs/splice.c:743
 do_splice_from fs/splice.c:941 [inline]
 direct_splice_actor+0x11d/0x220 fs/splice.c:1164
 splice_direct_to_actor+0x588/0xc80 fs/splice.c:1108
 do_splice_direct_actor fs/splice.c:1207 [inline]
 do_splice_direct+0x289/0x3e0 fs/splice.c:1233
 do_sendfile+0x564/0x8a0 fs/read_write.c:1363
 __do_sys_sendfile64 fs/read_write.c:1424 [inline]
 __se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1410
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5c6bb18d09
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5c6bacf218 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007f5c6bba0708 RCX: 00007f5c6bb18d09
RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000004
RBP: 00007f5c6bba0700 R08: 0000000000000000 R09: 0000000000000000
R10: 000080001d00c0d0 R11: 0000000000000246 R12: 00007f5c6bb6d620
R13: 00007f5c6bb6d0c0 R14: 0031656c69662f2e R15: 8088e3ad122bc192
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

This happens because ext4_page_mkwrite is not protected by the inode_lock.
The xattr semaphore is not sufficient to protect inline data handling in a
sane way, so we need to rely on the inode_lock. Adding the inode_lock to
ext4_page_mkwrite is not an option, otherwise lock-ordering problems with
mmap_lock may arise.

The conversion inside ext4_page_mkwrite was introduced at commit
7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
fixes a documented bug in the commit message, which suggests some
alternative fixes.

Convert inline data when mmap is called, instead of doing it only when the
mmapped page is written to. Using the inode_lock there does not lead to
lock-ordering issues.

The drawback is that inline conversion will happen when the file is
mmapped, even though the page will not be written to.

Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map")
Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Cc: stable@vger.kernel.org
---
Changes in v2:
- Convert inline data at mmap time, avoiding data loss.
- Link to v1: https://lore.kernel.org/r/20250519-ext4_inline_page_mkwrite-v1-1-865d9a62b512@igalia.com
---
 fs/ext4/file.c  | 6 ++++++
 fs/ext4/inode.c | 4 ----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index beb078ee4811d6092e362e37307e7d87e5276cbc..f2380471df5d99500e49fdc639fa3e56143c328f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -819,6 +819,12 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!daxdev_mapping_supported(vma, dax_dev))
 		return -EOPNOTSUPP;
 
+	inode_lock(inode);
+	ret = ext4_convert_inline_data(inode);
+	inode_unlock(inode);
+	if (ret)
+		return ret;
+
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 94c7d2d828a64e42ded09c82497ed7617071aa19..895ecda786194b29d32c9c49785d56a1a84e2096 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6222,10 +6222,6 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 
 	filemap_invalidate_lock_shared(mapping);
 
-	err = ext4_convert_inline_data(inode);
-	if (err)
-		goto out_ret;
-
 	/*
 	 * On data journalling we skip straight to the transaction handle:
 	 * there's no delalloc; page truncated will be checked later; the

---
base-commit: 4a95bc121ccdaee04c4d72f84dbfa6b880a514b6
change-id: 20250519-ext4_inline_page_mkwrite-c42ca1f02295

Best regards,
-- 
Thadeu Lima de Souza Cascardo <cascardo@igalia.com>


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] ext4: inline: convert when mmap is called, not when page is written
  2025-05-26 14:13 [PATCH v2] ext4: inline: convert when mmap is called, not when page is written Thadeu Lima de Souza Cascardo
@ 2025-05-26 16:20 ` Jan Kara
  2025-07-11  4:52   ` Theodore Ts'o
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2025-05-26 16:20 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Theodore Ts'o, Jan Kara, Tao Ma, Andreas Dilger, Eric Biggers,
	linux-ext4, linux-fsdevel, linux-kernel, kernel-dev,
	syzbot+0c89d865531d053abb2d, stable

On Mon 26-05-25 11:13:34, Thadeu Lima de Souza Cascardo wrote:
> inline data handling has a race between writing and writing to a memory
> map.
> 
> When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
> destroys the inline data, but if block allocation fails, restores the
> inline data. In that process, we could have:
> 
> CPU1					CPU2
> destroy_inline_data
> 					write_begin (does not see inline data)
> restory_inline_data
> 					write_end (sees inline data)
> 
> This leads to bugs like the one below, as write_begin did not prepare for
> the case of inline data, which is expected by the write_end side of it.
> 
> ------------[ cut here ]------------
> kernel BUG at fs/ext4/inline.c:235!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 1 UID: 0 PID: 5838 Comm: syz-executor110 Not tainted 6.13.0-rc3-syzkaller-00209-g499551201b5f #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> RIP: 0010:ext4_write_inline_data fs/ext4/inline.c:235 [inline]
> RIP: 0010:ext4_write_inline_data_end+0xdc7/0xdd0 fs/ext4/inline.c:774
> Code: 47 1d 8c e8 4b 3a 91 ff 90 0f 0b e8 63 7a 47 ff 48 8b 7c 24 10 48 c7 c6 e0 47 1d 8c e8 32 3a 91 ff 90 0f 0b e8 4a 7a 47 ff 90 <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90
> RSP: 0018:ffffc900031c7320 EFLAGS: 00010293
> RAX: ffffffff8257f9a6 RBX: 000000000000005a RCX: ffff888012968000
> RDX: 0000000000000000 RSI: 000000000000005a RDI: 000000000000005b
> RBP: ffffc900031c7448 R08: ffffffff8257ef87 R09: 1ffff11006806070
> R10: dffffc0000000000 R11: ffffed1006806071 R12: 000000000000005a
> R13: dffffc0000000000 R14: ffff888076b65bd8 R15: 000000000000005b
> FS:  00007f5c6bacf6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020000a00 CR3: 0000000073fb6000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  generic_perform_write+0x6f8/0x990 mm/filemap.c:4070
>  ext4_buffered_write_iter+0xc5/0x350 fs/ext4/file.c:299
>  ext4_file_write_iter+0x892/0x1c50
>  iter_file_splice_write+0xbfc/0x1510 fs/splice.c:743
>  do_splice_from fs/splice.c:941 [inline]
>  direct_splice_actor+0x11d/0x220 fs/splice.c:1164
>  splice_direct_to_actor+0x588/0xc80 fs/splice.c:1108
>  do_splice_direct_actor fs/splice.c:1207 [inline]
>  do_splice_direct+0x289/0x3e0 fs/splice.c:1233
>  do_sendfile+0x564/0x8a0 fs/read_write.c:1363
>  __do_sys_sendfile64 fs/read_write.c:1424 [inline]
>  __se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1410
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f5c6bb18d09
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f5c6bacf218 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
> RAX: ffffffffffffffda RBX: 00007f5c6bba0708 RCX: 00007f5c6bb18d09
> RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000004
> RBP: 00007f5c6bba0700 R08: 0000000000000000 R09: 0000000000000000
> R10: 000080001d00c0d0 R11: 0000000000000246 R12: 00007f5c6bb6d620
> R13: 00007f5c6bb6d0c0 R14: 0031656c69662f2e R15: 8088e3ad122bc192
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> 
> This happens because ext4_page_mkwrite is not protected by the inode_lock.
> The xattr semaphore is not sufficient to protect inline data handling in a
> sane way, so we need to rely on the inode_lock. Adding the inode_lock to
> ext4_page_mkwrite is not an option, otherwise lock-ordering problems with
> mmap_lock may arise.
> 
> The conversion inside ext4_page_mkwrite was introduced at commit
> 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
> fixes a documented bug in the commit message, which suggests some
> alternative fixes.
> 
> Convert inline data when mmap is called, instead of doing it only when the
> mmapped page is written to. Using the inode_lock there does not lead to
> lock-ordering issues.
> 
> The drawback is that inline conversion will happen when the file is
> mmapped, even though the page will not be written to.
> 
> Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map")
> Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
> Cc: stable@vger.kernel.org
> ---
> Changes in v2:
> - Convert inline data at mmap time, avoiding data loss.
> - Link to v1: https://lore.kernel.org/r/20250519-ext4_inline_page_mkwrite-v1-1-865d9a62b512@igalia.com
> ---
>  fs/ext4/file.c  | 6 ++++++
>  fs/ext4/inode.c | 4 ----
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index beb078ee4811d6092e362e37307e7d87e5276cbc..f2380471df5d99500e49fdc639fa3e56143c328f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -819,6 +819,12 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (!daxdev_mapping_supported(vma, dax_dev))
>  		return -EOPNOTSUPP;
>  
> +	inode_lock(inode);
> +	ret = ext4_convert_inline_data(inode);
> +	inode_unlock(inode);
> +	if (ret)
> +		return ret;
> +
>  	file_accessed(file);
>  	if (IS_DAX(file_inode(file))) {
>  		vma->vm_ops = &ext4_dax_vm_ops;

So I would *love* to do this and was thinking about this as well. But the
trouble is that this causes lock inversion as well because ->mmap callback
is called with mmap_lock held and so we cannot acquire inode_lock here
either.

Recent changes which switch from ->mmap to ->mmap_prepare callback are
actually going in a suitable direction but we'd need a rather larger
rewrite to get from under mmap_lock and I'm not sure that's justified.

Anyway, thanks for having a look!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] ext4: inline: convert when mmap is called, not when page is written
  2025-05-26 16:20 ` Jan Kara
@ 2025-07-11  4:52   ` Theodore Ts'o
  0 siblings, 0 replies; 3+ messages in thread
From: Theodore Ts'o @ 2025-07-11  4:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Thadeu Lima de Souza Cascardo, Jan Kara, Tao Ma, Andreas Dilger,
	Eric Biggers, linux-ext4, linux-fsdevel, linux-kernel, kernel-dev,
	syzbot+0c89d865531d053abb2d, stable

On Mon, May 26, 2025 at 06:20:11PM +0200, Jan Kara wrote:
> 
> So I would *love* to do this and was thinking about this as well. But the
> trouble is that this causes lock inversion as well because ->mmap callback
> is called with mmap_lock held and so we cannot acquire inode_lock here
> either.
> 
> Recent changes which switch from ->mmap to ->mmap_prepare callback are
> actually going in a suitable direction but we'd need a rather larger
> rewrite to get from under mmap_lock and I'm not sure that's justified.

This came up in discussions at this week's ext4 video conference, and
afterwards, I started thinking.  I suspect the best solution is one
where we avoid using write_begin/write_end calls entirely.  Instead,
we allocate a single block using ext4_mb_new_block() (see what
ext4_new_meta_blocks does for an example); then we write the contents
into the newly allocated blocks; and if the allocate and the write are
successful, only then do we start a jbd2 handle, to zero the
i_blocks[] array and then set logical block 0 to point at the newly
allocated data block, and then close the handle.

Splitting the block allocation from the "update the logical->physical
mapping", which is currently done in ext4_map_blocks(), into two
separate operations has been one of the things I had been wanting to
do for a while, since it would allow us to replace the current
dioread_nolock buffered write pqth, where we call ext4_map_blocks() to
allocate the blocks but mark the blocks as uninitialized, and then we
write the data blocks, and then we do a second ext4_map_blocks() call
to clear the uninitialized flag.

So it's a bit more involved since it requires refactoring parts of
ext4_map_blocks() to create a new function, ext4_set_blocks() which
updates the extent tree or indirect block map using block(s) that were
allocated separately.  But this would be useful for things besides
just the inline file conversion operation, so it's worth the effort.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-07-11  4:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-26 14:13 [PATCH v2] ext4: inline: convert when mmap is called, not when page is written Thadeu Lima de Souza Cascardo
2025-05-26 16:20 ` Jan Kara
2025-07-11  4:52   ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).